Sum of transition probabilities for finite sequence HMM

The usual constraint on transition probabilities in Hidden Markov Models is

\sum_{j=1}^N a_{ij} =1

for all states 1<=i<=N.

However, when used as generators, this makes impossible to exit the model naturally, and some arbitrary length has to be externally imposed to the generated sequences.

I feel that a more appropriate constraint would be

\sum_j=1^N a_{ij} < 1

for those states i that can be the last state visited. This matches the interpretation of a_{ij} as the probability of visiting state j from state i. In finite sequences, no other state in the model is reached when we are done, so accordingly the sum does not have to reach 1. The model can be exited naturally from states with some probability left for leaving, thus effectively generating variable length finite sequences.

Has this approach already been developed?

Hi @Jaume_Oliver_Lafont

The “constraint” on transition probabilities has nothing to do with “exiting the model”. The probabilities must sum to 1, so if you wanted to model the “exit” probability, you could create the exit transition. For example, a->b=0.25, a->c=0.25, a->exit=0.5.

But again, in the Assignment, the POS transition probabilities sum to 1 not because of first, last or any other state.

Or did I not understand your post correctly?

I understand the constraint makes exiting the model impossible. Once we enter the model, we are going to another emitting state with prob 1! Therefore all sequences are going to be infinite if not truncated arbitrarily from outside the model. I see your exit state with the additional transition(s) is an equivalent way to solve the problem of looping forever in the symbol-emitting state models.

Not yet in the assignment, still watching videos.

Thank you for the answer!

I’m not sure you understand that “exiting” is not the problem. We do not try to navigate somewhere or anything.
The thing we are trying to do is model the “invisible” (hidden) state - Part Of Speech tag - we only see words and we try to “imagine” what POS that word could be. There is no concept of exiting in our case/problem.

Anyway, just wanted to make this point.

But sentences are finite. Often we represent exiting with the full stop.

The same as the model is entered at states according to their initial probability, by symmetry there has to be some exit point if we are to model finite length sequences such as sentences.

The easiest way to notice the problem is in the last state of a left-right model. There is a definition of average time in a state as 1/(1-a_{ii} ) . So in a left-right model with a_{NN} = 1, the last state would have an infinite duration, it is unscapable.

Yes, but what about paragraphs, chapters, books, series … etc.? They could also be modeled.
The main point I want to make is that we can model what we want. And in our particular case we model POS (and “full stop” or “.” is included). If you look at what POS we have, there are many (46) states:

Number of POS tags (number of 'states'): 46
View these POS tags (states)
['#', '$', "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 
'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 
'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 
'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 
'WDT', 'WP', 'WP$', 'WRB', '``']

You can see the ‘.’ after the ‘–s–’ tag in this list. If we wanted, we could have included ‘–end–’ state.

And that -end- state should violate the constraint somehow, because there is no next state to it, right?

Ok, I’ll think again about this after the assignment.

As a last attempt to try to get my point across, consider a two state model for the following two classes of “sentences”:

“sign+one digit”: 0, -1, +4, 6, 7…

“integer”: +345, -12, 765, 67…,

The only difference between these models can be in the self-loop probability of the digit state, which is 0 for one digit numbers, but >0 (and <1 from my viewpoint) for signed integers.

Thanks again!

Correct. You should model then what are the transition probabilities from the ‘-end-’ state?

Of course there are Absorbing Markov chains and other ways to model terminal state, but I just wanted you to understand what we are trying to do in this weeks assignment. HMM are a branch science on itself and there are a lot of things to learn about them.

So your ideas are good and welcome but I wanted you to understand that it’s not a “limiting constrain” as it somehow makes this implementation wrong. :slight_smile:

Hello again.

Finished the assignment. I don’t mean there is a mistake with this particular implementation, but with the general interpretation of probabilities in this kind of sequence modeling.

Take a model with two states, noun (NN) and verb (VB).

If we train it with the labeled sentence “John sleeps”, the probability of this sequence is clearly 1. But there happens to be another sequence that also has probability 1, which is “John”. However, “John” as a complete sentence was not in the training data. When we evaluate prob(“John”) it means the probability of the sequence to start with “John”, not to be exactly “John”.

Let us take a larger dataset to train the HMM. “John”, “John writes”, “John reads”, “John walks”, “John runs”. The probability of these five sentences given the dataset should be 1/5 for any of them. However, the standard training will give
p(“John”) = 1
p(“John writes”) = 1/4
p(“John walks”) = 1/4
p(“John reads”) = 1/4
p(“John runs”) = 1/4

Again, the sum of the probabilities of all sentences is larger than 1, which means that the sequences are not interpreted as independent events, but “John” is only taken as a sequence start that is always followed by a verb.

A better modeling of finite sequences is achieved by relaxing the constraint sum(a_ij) = 1, which is better suited to semiinfinite sequences.

In this 5 sentence set, prob(N->V) should be 0.8 instead of 1. To evaluate the probatility of sentence “John” we should multiply three numbers instead of only two: the probability to start in state NN, the probability of emitting John in that state NN, and the probability of exiting after visiting tag NN, which is 1-0.8 = 0.2. This exit probability is not considered when applying the usual constraint and makes impossible having prob(“John”) < prob(“John”+V)

I don’t think we should model the transition probabilites from the “-end-” state the same as we do not care about the transition probabilites to the “-start-” state.

Hi @Jaume_Oliver_Lafont

I think you are misunderstanding couple of things. First, we do not model separate sentence probabilities. Our whole corpus is a one big sequence (of hidden POS and visible tokens).

POS transition probabilities sum to 1 (row values sum to 1, there is always a t_i for t_{i-1}):

For example,

  • there is 0.93 chance, that “.” (full stop) transitions to “–s–” (start of sentence).
  • there is 0.22 chance, that “–s–” (start of sentence) transitions to “DT” (determiner, like “The”)

Emission probabilities also sum to 1 (here transposed for better visualization* bellow, column values sum to 1, each token w_i can be emitted by some tag t_i (* note, in the assignment this matrix is not transposed):

This table has 46 POS tags that each emit one or more of the 23777 tokens with certain probability.
For example,

  • Token “–n–” (new line in the corpus) is always (1.00) emitted by “–s–” POS tag (start of sentence).
  • Token “.” is almost always (0.99) emitted by “.” POS tag.
  • Token “(” very often (0.83) was emitted by “(” POS tag.
  • Token “{” often (0.15) was emitted by “(” POS tag.
  • Token “your” was emitted by “PRP$” 0.02

And these probabilities are used this way (on the whole sequence (len(training_corpus)–>989860), including “.” and “–s–” (here there are just 3 POS, but in the Assignment there are 46 of them):
image

Cheers

Ah! One single sequence… I understand this reduces the states with no next state to only the last one in the corpus (the single t_i with no t_{i+1}) , so possibly this has very small practical relevance.

I found my concern about the start and end of sentences solved in next week videos with the <s> and <e> tags.

Thanks for your explanation!