To get the last token from the log probabilities

arvyzukai · April 21, 2023, 2:25pm

Hmm… I’m not sure. Let me elaborate:

If we are talking strictly about the next_symbol() function and log_probs then that is not entirely true.

For example, if the input sentence for next_symbol() is “It’s time for tea”, then the output is (symbol, symbol log probability) or in other words - (next word, next word’s log probability). For example, it could be the word “now”.

In other words, the model outputs the predictions for previous tokens and the next token, next-next token, next-next-next token and so on up to sequence length. (we do not care about the model’s predictions for “It’s time for tea” but we care for the next word prediction).

But if we are talking about the details what happens inside the function in this line:

    # get the model prediction
    output, _ = ...

then it’s more of what you said - the model gets a tuple of inputs. One part being input_tokens and the other padded_with_batch. Then the part you were talking about:

is the part that goes to input_encoder - which embeds, then LSTM layers encode the information for the decoder consumption. In other words, it tries to express/compress the information that is “inside” these words and their sequence so that decoder could do it’s best to decompress what was inside the input and to continue the sequence.

In general, when we train language models, the thing that happens is what you talked about. For example, when we have a sentence like “It’s time for tea”, we train the model like:

[<sos>] → what is the next word? → well, this time it’s [“It’s”]
[<sos>, “It’s”] → what is the next word? → well, this time it’s [“time”]
[<sos>, “It’s”, “time”] → what is the next word? → well, this time it’s [“for”]
[<sos>, “It’s”, “time”, “for”] → what is the next word? → well, this time it’s [“tea”]

and we update the weights accordingly.
We do this not only for efficiency (since we have the whole sentence and we should get most out of now and not reload it next time) but also so that the model could predict from any sentence length, even when its length is zero.

So to be clear, the model predicts every word, but the next_symbol function takes out only the one we care about.

Topic		Replies	Views
Log_prob in sampling_decode function NLP with Attention Models week-module-1	3	516	March 2, 2023
C4W1 - Ex3 - Incorrect third dimension of decoder output NLP with Attention Models week-module-1	1	315	February 29, 2024
In C10 all my translated sentences have double .. or ! at the end NLP with Attention Models week-module-1	6	454	November 3, 2023
UNQ_C9 About the next_symbol and the model NLP with Attention Models week-module-2	5	495	July 17, 2023
C4W1_Assigment_Exercise 3 - Decoder NLP with Attention Models week-module-1	12	606	August 13, 2024

To get the last token from the log probabilities

Related topics