C5 W2 Embedding words

Hello everyone,
In the slide below there are some ambiguities which I kindly ask you to clarify.

So Andrew suggests using binary classification problem (b.c.p) instead of using 10k units each with a theta vector of,I guess 300 parameters for the embedding vector e_c has 300 dimensions, and then employing softmax at each iteration since it is computaionally expensive.

So in the b.c.p method, we choose k negative words and apply the sigmoid:

  1. But then I would expect a (k +1)-dimensional y_hat predictions for our set of k negative words + 1 positive word. Why did Andrew mention 10k bi.c.p in the slide?
  2. Are we iterating over the same set of k words or do we use different set of k words at each iteration?
    a) if over the same set of k words then we are not training the theta parameters of other words ?
    b) if over different set of k words then dont we have vanishing or exploding gradients at each iteration since the theta parameters will first be initiliazed in the beginning?
  3. let´s say we go with the softmax method and train the the parameters for 10.000 words.What can we say about the characteristics of those the parameters? are they the components of this E embedding matrix?

thanks in advance
cheers,
Vahdet

Hi @Vahdet_Vural

I think you are asking about the lecture “Negative Sampling” in C5 W2.

The k in k negative words means the number of negative words, whereas the k in 10k means thousand. There are 10,000 words in the vocabulary, and instead of learning about all of them, we pick (e.g.) k=4 of them as negative, plus the one positive, to learn about a total of 4+1 out of 10,000 words.

Different. As the lecture said, randomly chosen k words. We randomly choose k words each time.

I don’t understand why this will lead to vanishing or exploding gradients.

Note that we only initialize all parameters once and only once before the training starts. We do not re-initialize anything over and over again. Also, a word that is not chosen as one of the k negative sample may serve as (1) a negative sample to another instance of positive sample, or (2) a positive sample itself in another instance.

They are components of the E embedding matrix because any \theta_t or \theta_c are drawn from that embedding matrix. As you pointed out initially, it’s a more computationaly expensive way, but it does not change where we take the \theta's from. If we take them from the E embedding matrix, they remain the components of the E embedding matrix.

Cheers,
Raymond

Thanks alot, after watching the relevant lectures again and together with your explanations, it has become clearer.
it is now clear that at every iteration we sample a positive context -target pair and some k negative pairs for each sentence or example in the training set.
From the videos one tends to think that we only train the parameters of one set of (k +1) pairs,which is the set of orange-juice instance in the video.

This is an important detail and Andrew does not mention this. I thought the parameters,\theta_t, of the softmax unit and the word embeddings,e_c, are different or doens not have to be the same.Otherwise in the softmax method we would predict all the time the same word as the context word since the
cosine similarity would be the highest, which would in turn make the softmax probabilty of the context word highest.

cheers,
vahdet

I see, vahdet. Thanks for sharing it with us.

Happy learning!
Raymond