Some confusion on Word2Vec model


In this slide, I am not able to clear some concepts. Hence first I mentioned the concept i learned ( let me know if it is wrong ), and after that, I mentioned the questions in my mind. Please help me to clarify it.

we have an input (i.e. context word ) and try to predict the target word. For this we calculate a one-hot vector of the context word according to the vocabulary and using the embedding matrix, we calculate the embedding vector for the context word (i.e. EC). Now we initialize the softmax parameter theta randomly and then pass the EC to the softmax function then it will predict the probabilities of the words in the vocabulary.
after that, it calculates the loss and then this loss backpropagate through the softmax function and optimizes the parameter theta.

Now my questions are -

  1. What theta is actually? (its significance ), is it the embedding vector of the target ? or what it is?
  2. Is there values of the embedded matrix are randomly initialize and during training, it got a specific value? Hence after running the model, we can get our embedded matrix having word embeddings ( is it TRUE? )
    if YES then how it is updated?

Hi @SurajKP79

There are some misconceptions in your post and your English is hard for me to understand. So let me address this point first:

That is not true. Softmax is just an operation that fits the vector to values from 0 to 1. An example.
In other words, the outputs (values) of the network are all over the place - negative, positive, big numbers, small numbers - and if you want them to be interpreted as probabilities (going from 0 to 1, and the sum of them to be equal 1), then you can use softmax.
So the theta is actually not a parameter (of the model), but the outputs of the model.

Yes the embedding matrix (values) are randomly initialized at the start of the training. These values then are constantly updated according how well the model predicts the targets - values that contributed to lowering the probability of the correct word are reduced and values that contributed to increasing the probability of the correct word are increased.