Sorry, I do not understand your statement, so I cannot answer.
As you seen on the screen, we have main vocab, that are words “Gender, Royal, Age, Food” and “Man, Woman, King, Queen, Apple, Orange” that try to explain each of vocab words and it is features right? And vocab can be 40k words, but features could be only 1000 for example. And this features (words) are extract from vocab that are most usefull to describe all words in vocab, aren’t they?
Sorry, I don’t know what you mean by “features” in this context.
The same as earlier. For example in the initial topic body.
I am trying to understand what are this features and words embeddings too in this topic. I just shared my suppose what features are. So
Sorry, I’d better wait for a mentor for this course to reply. I don’t want to give you a misleading or confused answer.
No, they are not.
They (1000 of them for each of the 40k words in your example) are just float numbers that best fit the training (the loss function).
In other words, the training process tries to change this (40_000 x 1_000) embedding weight matrix (and other layers’ matrices) as much as possible to fit the data (by minimizing the loss function).
A similar example from Course 3 that might help - here the embedding dimension is 50 and they are not from vocabulary or anywhere else - they were initially randomly created and updated accordingly - lowered or increased if the prediction matched the target.
In your picture, this matrix is sideways (meaning the features are 4 - Gender, Royal, Age and Food; and the vocab size is 6 - Man, Woman, King, Queen, Apple and Orange) - in other words, the features are usually the columns. And in your picture the 4 features are just for illustration purposes - in reality they are not that interpretable - instead they would be 0, 1, 2, 3 (and not any word from the vocabulary or any word at all).
Cheers
ML algorithms need numbers to learn patterns from the data. Text data comprises of words and sentences. These need to be converted into some sort of numerical form in order to be made ready for finding patterns from them. Vector representations are nothing but some kind of numerical representations of words in a sentence.
- One Hot Encoding:- Assign numbers to each word from the unique set of words present in the corpus.
- Word Embeddings:- A numerical representation that takes the semantic meaning of the words and their association w.r.t. to the corpus into consideration. This is helpful in understanding the context of the words in the sentence, as the same words may infer different meanings when used in different sentences.
In a nutshell, word vectors are numerical representations of words.
Regards, Shrikrishna.
Question: Can pre-trained word embeddings be augmented? The Sequence Models course doesn’t quite say if pre-trained word embeddings (like GloVe) can be modified through transfer learning. This is an important subject, since there’s a huge need for specialized vocabulary for specific problem domains (in medicine and various scientific/engineering fields).
If you have, say, only 5,000 sentence examples, can a specialized word embedding be effectively developed? This might seem more like a RAG and LLM question, but this topic is more “under the hood” than a prompt-engineering question. Thanks!
Yes, and they are often fine-tuned (along with other layers) for better results (whatever that might mean for a certain project).
One important aspect of this “specialized vocabulary” is tokenization. In short, certain models use certain tokens and you cannot simply change the tokens. You can fine-tune embeddings of current tokens but adding specialized tokens would require model retraining or at least a bigger fine-tuning phase (5000 sentence examples would probably not yield any improvement compared to original tokenizer).
It very much depends on the model size, the goal and other aspects (for example, a moderate size classification model would produce better predictions after being fine-tuned on such size dataset compared to a full chat bot which would probably do better with RAG)
I would probably classify the original question to a “fine-tuning” category ![]()