LLM Paper - Knowledge

Stav28 · February 10, 2024, 8:30pm

In Attention is all you need paper it mentioned the “Learned Embedding” that they used in Transformer, did they use CBOW, Glove or any other methods. Or they use nn.Embeding and train it with model training step for updating loss?

I also found a paper where they use cosine similarity to generate embedding.

paulinpaloalto · February 11, 2024, 1:16am

The subject of Word Embeddings and how to train them is covered in some detail in DLS Course 5 (Sequence Models) Week 2. Prof Ng shows various techniques for training word embeddings and discusses several of the most successful ones. I don’t know which one was used in any of the specific papers or the various recent LLMs (GPT-n for various values of n).

Stav28 · February 17, 2024, 5:48am

Wanted to train 3b decoder model so I was getting the mapping of literature and code.

paulinpaloalto · February 17, 2024, 4:31pm

Sorry, I don’t know what a “3b Decoder” is, but the point with Attention Models and Word Embeddings is that training the Word Embedding model is a completely separate step that needs to be done before you start putting together your Attention Model. The Word Embedding model is a tool that you use by calling it at various stages in your Attention Model code. As I mentioned, there is good material explaining how to build and train Word Embedding models in DLS C5 W2. Then they cover Attention Models and how they are built using Word Embeddings in the last two weeks of DLS. But the point is that you don’t need to do the training of the Word Embeddings: it’s just a question of picking the best pre-existing trained model for your particular purposes. Prof Ng also discusses those choices in DLS C5 W2: he describes some of the popular embedding models.

Stav28 · February 18, 2024, 6:30am

Thanks

Stav28 · February 18, 2024, 6:45am

I have gone through that course, but it is not matching with the paper I have read.
Because how do you address increasing token, If we pretrain Embedding to context A, without pretraining how they relate to Context B. As most decoder model are context aware training. The sole reason to train on 135 B is to incorporate much of Contexts. If we had a pretrained embedding I guess this would not be necessary?

And most available Embeddings are made with Encoder only models.

paulinpaloalto · February 18, 2024, 4:50pm

Well you need to pick a general purpose Word Embedding model that has been trained on a large corpus in the language you are dealing with (English, French, German …). Google and other companies have trained such models, using very large and generally purpose input datasets (What’s the plural of “corpus”? Maybe “corpi”? “Corpuses” sounds a bit odd somehow.). Then you do further training of your Attention model with its Encoders and Decoders with training data that reflects the actual task you are trying to perform.

Encoders and Decoders in the sense that those terms are used in Attention do something more complex than a Word Embedding does, but the output of the Word Embeddings are input that is used at various points. But if what you mean by that statement is that Word Embeddings are essentially a “one way” function that maps from a vocabulary to the embedding vectors, then yes I agree. You see examples in DLS C5 where they implement the other direction, but it ends up being an exhaustive “search and compare”, so it’s pretty costly. E.g. there was one example where they did a search to find the word with an embedding that was closest to the computed compound embedding that they had generated in the “Debiasing” assignment. You could interpret that as an example of the “Decoder” mode of a Word Embedding.

Topic		Replies	Views
Transfer Learning or Train NLP with Attention Models week-3	3	529	April 27, 2023
Week3 - I have just completed the course, excited to put my knowledge into practice! Generative AI with Large Language Models week-1	2	42	October 15, 2024
Week 1: Pretraining Large Language Models Generative AI with Large Language Models ai-discussions , large-language-model , llm	1	42	November 17, 2024
Week2 - Learning Word Embeddings Sequence Models	2	537	August 7, 2022
Week 2 quiz - word embeddings Sequence Models	2	577	March 12, 2022

LLM Paper - Knowledge

Related topics