So, what is word embeddings?

As I remember from Deep Learning course, embeddings were explained like something that is labled by some value by it’s meaning (feature). Like feature “gender” — -1 for “woman” word and 1 for “man”.

So, in nlp course I listened, that word embedding is only something that is generated from the context.

So, in first case, I understood that we can create word embedding by themselves or take a pretrained. In second case it is clear, that this process is automatical and universal for all words.

So, what is word embeddings at the end? I understand clearly now is only that it is something that contains the sense of word by its key in dict.

Hello @someone555777 ,

Word embeddings are vector representations of words.
They help in comprehending the meaning of text better by understanding the semantic and syntactic connections/links/relations between words, which helps NLP models perform better with greater accuracy.

This of 2 types:

  1. Generated from text corpus
  2. Manually labeled

Applications:

text classification- sentiment analysis, question answering , summary generation, etc

With regards,
Nilosree Sengupta

1 Like

so, what exactly vector representations are? Is it reglamented at all? Or do I only choose features or use external and maybe add any to them?

Hello @someone555777 ,

The term “vector representation” means the encoding of words or phrases as numerical vectors. Therefore, each word is encoded as dense vectors, with each dimension representing features/aspects of the word like meaning, context, relationship, etc with other words.

In short, this helps to understand the correlation between words.

Regarding choice of features, it depends on the type of project needs. It can be generated from corpus, as well as you can go for feature concatenation to add specific features of your choice.

Hope this helps.

With regards,
Nilosree Sengupta

so, it is not reglamentent, and from project to project it can be different features and theirs values, right? And format of features too, right? For example in one project gendor is defined as negative and positive numbers and in another project like numbers from 0 to 1?

Hey @someone555777,
I believe what you are trying to mention is “regimented”, right? If yes, then the answer is I suppose not.

You can extract word embeddings from different text corpus, using different techniques, for instance, you can create embeddings using Bag of Words, TF-IDF, Word2Vec, etc. Each technique used on different text corpus, will provide you with a different set of word embeddings. Now, it’s up to you, if you wanna use the same set of word embeddings across applications, or if you wanna train word embeddings from scratch for each of your applications.

What I believe could be one of the factors to help you decide this is the dataset that you are considering for your down-stream application. If your dataset consists of samples including text from the web, then most likely, the existing pre-trained word-embeddings could be employed. But if your dataset includes samples from proprietary text (for instance medical documents), then you could consider training word embeddings from scratch in this case.

Another factor that you need to consider when training word embeddings from scratch is the size of corpus. Trivially, larger the corpus, richer will be your word embeddings in terms of representation.

Let us know if this helps you out.

Cheers,
Elemento

do you mean any specific words, that can be not in public pretrained word embeddings?

Indeed, that is one of the interpretations. But another notable thing here is that corpus from the web may still have words from proprietary text, but they may not include them in their samples abundantly, which may lead to poorer feature representation of these words down the line.

Considering the example of medical documents again, these words could be present in abundance in your proprietary dataset, which could lead to richer word embeddings, which could further lead to an increase in the performance.

Cheers,
Elemento

ok, I understand, thx. So, in most part of cases Word Embeddings vectors are like simple np.array, right? like [1, 2, 3] or just a number?

ok, so, I can use the same Embedding, but extend description of some words by introducing of new features for example, right?

Hey @someone555777,

If you are asking this as a hypothetical question, then no one is stopping you from creating 1-unit word embeddings, but the question you need to ask is, “Will those word-embeddings be useful?”. So typically, word embeddings are vectors.

Let’s examine what “introducing new features” could mean. Let’s say that you have pre-trained word embeddings of 256 dimensions. Now, if by “introducing new features”, you want to extend word-embeddings of certain words only, then this would create an issue, since the word embeddings will be of non-uniform length. Hence, this is not the way out.

One simple way could be to load the model’s weights which was used to extract word embeddings, and fine-tune it on your dataset.

But note a small caveat in this approach. The pre-trained word embeddings will only contain the word embeddings for the words in the vocabulary associated to it’s text corpora. So, if you have a lot of words in your corpora which are missing from the pre-trained word embeddings’ associated text corpora, once again, training word embeddings from scratch is a good option.

Cheers,
Elemento

omg, will it not break my data if I will add zeros to another features of words? So, do I understand correct, that if I want to add new features or a new word, it would be good to manually label each feature for each word and vise versa? Or is it appeared automatically from context? I understand enough poorly how model will understand what features should be applied and with what values to each word. Will be by features that was applied to words from the same context?

Hey @someone555777,

These are just some of the possible ways in which you can improve your pre-trained word embeddings. To know which one works the best, you can implement these for your application, and compare them for yourself. Do share your results with the community.

In my experience of 2 years with AI, I haven’t seen a single example, in which someone has manually created word embeddings for all the words in the vocabulary. Would you like to take that endeavour? :thinking:

Cheers,
Elemento

1 Like

Hello @someone555777 ,

I hope your doubts are now cleared by @Elemento .

With regards,
Nilosree Sengupta

No. That’s not at all how embeddings work. It’s not like adding features to a data set used in regression.

An embedding is a big matrix that tells you how the words in a vocabulary are related. You can’t just add new features to it, unless you relate them to the underlying vocabulary.

For example, you could change the vocabulary to add new words, and then train it from scratch on a set of text that uses the new words you have added.

2 Likes

Hey @someone555777,
As Tom and Arvydas pointed it out, adding zeros is highly unlikely to work, since it is not the way word embeddings are supposed to work. I have modified my answers accordingly, let us know if your doubts are resolved or not.

Cheers,
Elemento

1 Like

do I understand correct that features are words of word emedding too? Why do we have much less features than dictionary in this case? And how to train it? Only from context in this case? Does it contain manual work anytime?

Sorry, but I do not understand what you mean.

Can we characterize each of word embeddings as human word? Like Andrew said

In this case all words can be connected with all other words, aren’t it?

In this case I just a bit extend your answer, that “An embedding is a big matrix that tells you how the words in a vocabulary are related to words that were used in training.”

So, do I understand correct, that features are conceptually near the same as words, that where used during a training to explain each word? Like “Man, Woman, King, Queen, Apple, Orange” in the screen.

Maybe this topic is connected a bit too to understand what are word embeddings.