In Attention is all you need paper it mentioned the “Learned Embedding” that they used in Transformer, did they use CBOW, Glove or any other methods. Or they use nn.Embeding and train it with model training step for updating loss?
I also found a paper where they use cosine similarity to generate embedding.
The subject of Word Embeddings and how to train them is covered in some detail in DLS Course 5 (Sequence Models) Week 2. Prof Ng shows various techniques for training word embeddings and discusses several of the most successful ones. I don’t know which one was used in any of the specific papers or the various recent LLMs (GPT-n for various values of n).
Sorry, I don’t know what a “3b Decoder” is, but the point with Attention Models and Word Embeddings is that training the Word Embedding model is a completely separate step that needs to be done before you start putting together your Attention Model. The Word Embedding model is a tool that you use by calling it at various stages in your Attention Model code. As I mentioned, there is good material explaining how to build and train Word Embedding models in DLS C5 W2. Then they cover Attention Models and how they are built using Word Embeddings in the last two weeks of DLS. But the point is that you don’t need to do the training of the Word Embeddings: it’s just a question of picking the best pre-existing trained model for your particular purposes. Prof Ng also discusses those choices in DLS C5 W2: he describes some of the popular embedding models.
I have gone through that course, but it is not matching with the paper I have read.
Because how do you address increasing token, If we pretrain Embedding to context A, without pretraining how they relate to Context B. As most decoder model are context aware training. The sole reason to train on 135 B is to incorporate much of Contexts. If we had a pretrained embedding I guess this would not be necessary?
And most available Embeddings are made with Encoder only models.
Well you need to pick a general purpose Word Embedding model that has been trained on a large corpus in the language you are dealing with (English, French, German …). Google and other companies have trained such models, using very large and generally purpose input datasets (What’s the plural of “corpus”? Maybe “corpi”? “Corpuses” sounds a bit odd somehow.). Then you do further training of your Attention model with its Encoders and Decoders with training data that reflects the actual task you are trying to perform.
Encoders and Decoders in the sense that those terms are used in Attention do something more complex than a Word Embedding does, but the output of the Word Embeddings are input that is used at various points. But if what you mean by that statement is that Word Embeddings are essentially a “one way” function that maps from a vocabulary to the embedding vectors, then yes I agree. You see examples in DLS C5 where they implement the other direction, but it ends up being an exhaustive “search and compare”, so it’s pretty costly. E.g. there was one example where they did a search to find the word with an embedding that was closest to the computed compound embedding that they had generated in the “Debiasing” assignment. You could interpret that as an example of the “Decoder” mode of a Word Embedding.