Questions about Transformer Models

Hi there :slight_smile:

I have just watched the video “Generating Text with Transformers” and I have a few questions about different stages of the process.

Let’s suppose the input is “What colour is the sky at sunset?”

I understand this sentence will be tokenized. Each token is mapped into a vector which is then summed up to its positional encoding and passed to the heads, each head being responsible to process the vectors according to their own specialty.

And here I’m unclear of what happens. In the previous video it’s said that “The output of this layer is a vector of vector logits proportional to the probability score for each and every token in the tokenizer dictionary.” Does it mean each head will attach a score to each vector, which can make a vector hold hundreds of scores? Also, are the heads somehow adding metadata or whatever that instructs the model into translating the input to Spanish, rather than answering “orange”, for example?

Then, the video says:

At this point, the data that leaves the encoder is a deep representation of the structure and meaning of the input sequence.

What is this data like? Is it a collection of vectors, or a single, multidimensional vector?

Now, about the decoder:

This representation is inserted into the middle of the decoder to influence the decoder’s self-attention mechanisms.

What do you mean by that? Is it used to filter which heads will be used in decoding?

Another question about the decoder:

I can understand if the decoder selects the most probable next token in the case of a question-answer process, as it’s generating a brand new answer. But in the case of a translation, will it make a sort of token-by-token mapping, picking the most probable equivalent in the target language? Or is it more like “hey, here’s the context provided by the encoder so you have an idea of what we’re talking about. Now go and follow your guts to create an equivalent in Spanish?”
(sorry to use such non-technical phrasing)

Another question: how does the model know when to stop? Is it when the the most probable next token is a period/question mark/etc?

Thank you!

I’ll try to answer the questions.

I am not entirely sure what you mean. Each head will indeed hold a score for each input word, then they will be concatenated in the latent dimension. The initial size of the tensor is (batch, sequence_length, latent_dimension), then at each head the Q, K ,V reduces the size to (batch, sequence_length, latent_dimension / n), and after concatenation you get (batch, sequence_length, latent_dimension) back. Does this answer your question??

No, the model learns this during training. When you are translating the output is the translation. The model basically learns to get the next word based on the previous text (including the prompt), so if you are saying translate into spanish then it will learn to do that.

It is used to modify the weights of each of the input words.
In the self-attention the model uses the tokens in 3 different ways called query, key and values each having their own purpose. In the encoder, the input is the tokens, whereas in the decoder, the outputs are used. In the decoder, the self-attention is used twice the first time with those outputs alone and another time with both the outputs and the output of the encoder.
Notice that at inference is the same, the outputs are the generated ones until that point.

Your second guess is probably closer. The idea is the same as when answering, the model will produce a word each time using the inputs and the generated output until the step that is in. The first time using the input text and initial keyword/token ([BOS] I think it was), the second time using the initial token and the previously generated word, and so on. So, it uses all the information at hand and gets a translation not by words.

A special token is used for that, I think it was [EOS] End of sequence. So, when the next work is [EOS], the model stops producing new words.

I am not sure whether I have confused more with that, but hope it helps.

If you’re interested in learning more, the Sequence Model/Transformers video by Andrew on Coursera could be worth a watch: Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

There are a lot of videos and explanations about Transformers out there, but (as usual) Andrew does a pretty good job of breaking it down into simpler terms.

And one more note I think might be helpful for your understanding: the decoder really only generates one token/word at a time, by outputting the probabilities for each of the possible tokens/words. In most cases, we need to use an algorithm like beam search to find the most likely sentence. Andrew does explain that a little as well in the Deep Learning Specialization as well.