To get the last token from the log probabilities

Hello, I have a hard time understanding how to get the last token’s log probabilities. From the assignment hints, (as I have posted below) I understand the 3, ([0,3,:]) is the index to token 5. However, I don’t quite get the meaning of 0. What does this 0 stand for? does this 0 have something to do with the 1st dimension of log probability, batch size??

Also, I am guessing from the hints, it says the size of log probability, is (batch size, decoder length, vocab size), decoder length is the number of words in the sentences, vocab size is the vector dimension we used to represent each word?

I much appreciate it if someone can help.

Hi @Fei_Li

To clarify:

  • first dimension is batch dimension (how many batches or “sentences”)
  • second dimension - decoder length - how many words (note in this case the decoder predicts all the words including padding)
  • third dimension - vocab size - what are the probabilities for each word.

What is asked from you is to get the next word prediction - for that you calculate log_probs variable and it should be:

  • since we are predicting only one sentence in this assignment, the batch value should be 0 (first and the only batch)
  • since we are predicting the next token, this number should represent that (hint: it’s the token_length - the length of the current outputs, that is without padding of course and you get this value in the first line). In other words, we already have predicted the previous words and we want to know what the model thinks about the next one.
  • since we don’t know which word is the most probable, we here want all the probabilities (you get that with : slice)

This is how you get the variable log_probs - as per instructions [0, 3, :], except the 3 should be changed to the appropriate variable.

Does this clarify a bit or you still don’t quite understand the details?

That helps a lot. Let me have this example to help express my understanding:
so say, our sentence is “It’s time for tea”
first, the batch size is 0, then the code:log_probs = output[0, token_length,:] would start with [0,0,:], then [0,1,:] then [0,2,:], the increment in the second dimension goes with the increment in length of the words that we have predicted: it is like, we start with “It’s”, then “It’s time”, then “It’s time for”.

as for the semicolons, say we have predicted “It’s” then we want to pick the highest probability from all the words from the vocabulary ( I saw one vocabulary in Andrew’s course it’s like a, Aaron,… ,zulu) for the next word.

I think I get it, do I? Thank you very much for your help.

Hmm… I’m not sure. Let me elaborate:

If we are talking strictly about the next_symbol() function and log_probs then that is not entirely true.

For example, if the input sentence for next_symbol() is “It’s time for tea”, then the output is (symbol, symbol log probability) or in other words - (next word, next word’s log probability). For example, it could be the word “now”.

In other words, the model outputs the predictions for previous tokens and the next token, next-next token, next-next-next token and so on up to sequence length. (we do not care about the model’s predictions for “It’s time for tea” but we care for the next word prediction).

But if we are talking about the details what happens inside the function in this line:

    # get the model prediction
    output, _ = ...

then it’s more of what you said - the model gets a tuple of inputs. One part being input_tokens and the other padded_with_batch. Then the part you were talking about:

is the part that goes to input_encoder - which embeds, then LSTM layers encode the information for the decoder consumption. In other words, it tries to express/compress the information that is “inside” these words and their sequence so that decoder could do it’s best to decompress what was inside the input and to continue the sequence.

In general, when we train language models, the thing that happens is what you talked about. For example, when we have a sentence like “It’s time for tea”, we train the model like:

  • [<sos>] → what is the next word? → well, this time it’s [“It’s”]
  • [<sos>, “It’s”] → what is the next word? → well, this time it’s [“time”]
  • [<sos>, “It’s”, “time”] → what is the next word? → well, this time it’s [“for”]
  • [<sos>, “It’s”, “time”, “for”] → what is the next word? → well, this time it’s [“tea”]

and we update the weights accordingly.
We do this not only for efficiency (since we have the whole sentence and we should get most out of now and not reload it next time) but also so that the model could predict from any sentence length, even when its length is zero.

So to be clear, the model predicts every word, but the next_symbol function takes out only the one we care about.

2 Likes

Sure. Thank you, mentor, you give such an explicit explanation. I am now more clear.