Word embeddings are vector representations of words.
They help in comprehending the meaning of text better by understanding the semantic and syntactic connections/links/relations between words, which helps NLP models perform better with greater accuracy.
This of 2 types:
Generated from text corpus
text classification- sentiment analysis, question answering , summary generation, etc
The term “vector representation” means the encoding of words or phrases as numerical vectors. Therefore, each word is encoded as dense vectors, with each dimension representing features/aspects of the word like meaning, context, relationship, etc with other words.
In short, this helps to understand the correlation between words.
Regarding choice of features, it depends on the type of project needs. It can be generated from corpus, as well as you can go for feature concatenation to add specific features of your choice.
so, it is not reglamentent, and from project to project it can be different features and theirs values, right? And format of features too, right? For example in one project gendor is defined as negative and positive numbers and in another project like numbers from 0 to 1?
I believe what you are trying to mention is “regimented”, right? If yes, then the answer is I suppose not.
You can extract word embeddings from different text corpus, using different techniques, for instance, you can create embeddings using Bag of Words, TF-IDF, Word2Vec, etc. Each technique used on different text corpus, will provide you with a different set of word embeddings. Now, it’s up to you, if you wanna use the same set of word embeddings across applications, or if you wanna train word embeddings from scratch for each of your applications.
What I believe could be one of the factors to help you decide this is the dataset that you are considering for your down-stream application. If your dataset consists of samples including text from the web, then most likely, the existing pre-trained word-embeddings could be employed. But if your dataset includes samples from proprietary text (for instance medical documents), then you could consider training word embeddings from scratch in this case.
Another factor that you need to consider when training word embeddings from scratch is the size of corpus. Trivially, larger the corpus, richer will be your word embeddings in terms of representation.
Indeed, that is one of the interpretations. But another notable thing here is that corpus from the web may still have words from proprietary text, but they may not include them in their samples abundantly, which may lead to poorer feature representation of these words down the line.
Considering the example of medical documents again, these words could be present in abundance in your proprietary dataset, which could lead to richer word embeddings, which could further lead to an increase in the performance.
If you are asking this as a hypothetical question, then no one is stopping you from creating 1-unit word embeddings, but the question you need to ask is, “Will those word-embeddings be useful?”. So typically, word embeddings are vectors.
Let’s examine what “introducing new features” could mean. Let’s say that you have pre-trained word embeddings of 256 dimensions. Now, if by “introducing new features”, you want to extend word-embeddings of certain words only, then this would create an issue, since the word embeddings will be of non-uniform length. Hence, this is not the way out.
One simple way could be to load the model’s weights which was used to extract word embeddings, and fine-tune it on your dataset.
But note a small caveat in this approach. The pre-trained word embeddings will only contain the word embeddings for the words in the vocabulary associated to it’s text corpora. So, if you have a lot of words in your corpora which are missing from the pre-trained word embeddings’ associated text corpora, once again, training word embeddings from scratch is a good option.
omg, will it not break my data if I will add zeros to another features of words? So, do I understand correct, that if I want to add new features or a new word, it would be good to manually label each feature for each word and vise versa? Or is it appeared automatically from context? I understand enough poorly how model will understand what features should be applied and with what values to each word. Will be by features that was applied to words from the same context?
These are just some of the possible ways in which you can improve your pre-trained word embeddings. To know which one works the best, you can implement these for your application, and compare them for yourself. Do share your results with the community.
In my experience of 2 years with AI, I haven’t seen a single example, in which someone has manually created word embeddings for all the words in the vocabulary. Would you like to take that endeavour?
As Tom and Arvydas pointed it out, adding zeros is highly unlikely to work, since it is not the way word embeddings are supposed to work. I have modified my answers accordingly, let us know if your doubts are resolved or not.
do I understand correct that features are words of word emedding too? Why do we have much less features than dictionary in this case? And how to train it? Only from context in this case? Does it contain manual work anytime?