DLS5 W2 Learing Word embeddings

dds · January 31, 2023, 8:19pm

Just looking for a big picture on how Embedding matrix is computed - may be I missed something in the week 2 lectures. The lectures did cover various underlying concepts (word representation, featurized representation as word embeddings, how to use word embeddings/transfer learning, word embedding matrix E, Skip-Gram model).

Somehow I am missing how enteries of matrix E are computed. I can see the SL training task – but that training task is mostly estimating parameters of the last layer - softmax or logistic classifiers. But not clear how the training takes care of matrix E.

Thanks.

Juan_Olano · January 31, 2023, 11:13pm

Hi @dds ,

This is my ‘big picture’ on how embedding matrix is computed:

The neural network used to calculate E is actually a very shallow one: It has a hidden layer and the output layer. The output layer being a softmax.
The model is fed with a very large amount of text (corpus). The way you feed these words was explained in lecture, specifically in the Skip-Gram lecture. Another possible way is called CBOW (Continuous Bag of Words). Lets assume that this is clear.
And here’s the trick: while the model is being trained, the W matrix of the unique hidden layer is continuously updated in the back propagation. At the end of training, the W matrix of the hidden layer is our Embedding Matrix E!

And that’s it!

This is, again, the ‘big picture’ on how Embedding matrix is computed.

Let me know if you’d like to double click on any of these points.

Thanks!

Juan

Juan_Olano · January 31, 2023, 11:49pm

I wanted to add something else to item #3 above:

One may ask: how is it that the W matrix of the hidden layer learns the values that represent the semantic of the target words?

In training, we are feeding the model with context words and a target word ( or with a target word and negative samples, depending on the chosen method). For this explanation, lets go with the model of context words + target word, where context words can be, for example 4 words before and 4 words after the target word.

So, if different target words have similar context words, then we would expect that the output of the model would be similar in these cases. And again, if the outputs are similar, then the corresponding W values (calculated in the backprop) would also be similar for these target words.

Another important detail: the hidden layer is a fully-connected layer and the number of neurons is usually set between 100 and 300, and each would represent a ‘feature’. Sometimes more, sometimes less. So the W matrix of the hidden layer would be of size (feature, vocab_size). For 10,000 words and 300 features, this W would be shape (300, 10,000).

dds · February 1, 2023, 4:08am

@Juan_Olano Thank you so much. I think I get the outline of the process.

Thanks for pointing out CBOW. I must have missed it if it was covered in lectures. In fact from the original paper (Mikolov, et al) it seems CBOW does outputs a word given context words as input (Skip gram goes other way). So it seems that for analogies ot senetnce completion CBOW would be the answer. (Fig 1 of https://arxiv.org/pdf/1301.3781.pdf)

I will read some more and google some more. I will reach out if there are questions. My goal is jost to have a high level picture of what goes on.

Juan_Olano · February 1, 2023, 7:19pm

Thanks for the feedback.

CBOW is very briefly mentioned by Prof. Ng in the Word2Vec lecture, and I think it is worth understanding it as well, along with the Skip-Gram technic.

As you say, CBOX outputs a word given a context while Skip Gram outputs a context given a word, but at the end of the day what I think is the central point of the discussion is ‘where is the embedding matrix generated’, and this is, as shared, the W matrix of the hidden layer.

Google will provide a wealth of links that discuss Word2Vec - if you find any new insight, please share it!

Thanks,

Juan

Akshay_Parundekar · September 7, 2023, 6:25am

As Juan_Olano pointed out, W can be a matrix of shape [No. of features X Vocab size), where each feature would represent a learning from a single neuron. If I understand correctly, the attributes of the features like ‘fruit’, ‘alive’, ‘royal’ are indicative attributes and in actual we dont need to define these. The model learns these attributes for every word as a column vector of W.
Please correct if my understanding is wrong.

Juan_Olano · September 7, 2023, 12:55pm

Your understanding is correct. We don’t have to define these attributes. Instead, they are learned by the model while you train a word embedding.

Namhoang · September 12, 2023, 12:04pm

Hello, I also have 1 question, in the computation of softmax p(t|c) of Word2Vec what is the dimension of 𝜃_t?
I saw it in the quiz but I can’t find it mentioned in the Word2Vec video.

Topic		Replies	Views
How is the value Embedding Matrix (E) calculated? Sequence Models week-2	4	325	February 10, 2024
How are word embedding calculated end to end NLP with Sequence Models week-1	6	593	January 10, 2023
How do we obtain the embeddings from CBOW? Sequence Models	1	493	October 11, 2022
W2 "Neural Language Model" slide missing diagram Sequence Models	1	493	March 18, 2023
Some confusion on Word2Vec model NLP with Sequence Models week-2	1	469	July 5, 2023

DLS5 W2 Learing Word embeddings

Related topics