Are there dense layers in word2vec?

The Probabilistic Language Model has dense layers preceding the softmax layer. Such dense layers are not shown in the picture that Andrew drew for word2vec:
image

Is that on purpose or there are really dense layers in word2vec as well?

Yes, there is an implied dense layer there, that’s where the theta values come from.

In the Probabilistic Language Model, there is more than one. Why would it not be the case in Word2Vec?

Perhaps there is.
Yes, I’ll ask another mentor to handle your question.

Hey @Meir,

In Word2Vec paper they presented two model architectures: CBOW model and continuous skip-gram model. They both have a single hidden layer that aim to learn word representations.

In general, you are free to use more hidden layers. As far as I understood they decided on a single layer just for the sake of efficiency.

Just to clarify the video example,

o_{c} \to E \to e_{c} \to \textrm{softmax} \to \hat{y}

is equal to

\hat{y} = \textrm{softmax}(Eo_{c} + b)

Where

  • o_{c} is an input vector represented as one-hot vector.
  • E is a trainable weights matrix.
  • Eo_{c} + b is a linear (or dense) layer that outputs a vector e_{c} of non-normalized probabilities.
  • \textrm{softmax}(e_{c}) layer that outputs a vector \hat{y} of normalized probabilities.
1 Like

Hi @manifest, this is very helpful. Can you clarify some more?

  • oc is a one-hot vector the length of the dictionary
  • E is a weights matrix that outputs a feature vector ec of arbitrary length (300 in lecture)
  • Eoc + b outputs a vector of non-normalized probabilities the same length as the dictionary, one element per word.

How does the 300-element feature vector become the 10,000-element vector of probabilities? Are the two trainable weights matrices E different from each other?

I feel that this has something to do with \thetat, but I haven’t quite grasped it.

Hey @CharmingQuark,

You are right. There are actually two models :slight_smile:
I guess in the lecture, they just didn’t want to complicate things.

Given a vocabulary of size 10000, m being a batch size and T_{x} being the sequence size, we have the following shapes for the input one-hot vector o_{c} and the weights matrix E:

  • o_{c} \in \mathbb{R}^{m \times T_{x} \times 10000}

  • E \in \mathbb{R}^{300 \times T_{x}}

The dot product of these matrices, e_c = E o_{c}, will be of the following shape:

  • e_c \in \mathbb{R}^{m \times 300 \times 10000}

In the lecture, \textrm{softmax} is not just a function, but a layer meaning that it includes a linear (or dense) layer with trainable weights \theta_t. It is of the following shape:

  • \theta_{t} \in \mathbb{R}^{1 \times 300}

The dot product of these matrices, \hat{y} = \theta_{t}e_c, gives us the 10000-element vector of probabilities:

  • \hat{y} \in \mathbb{R}^{1 \times 1 \times 10000}

So, there are actually two models stacked together:

  • e_{c} = Eo_{c} + b_{1}, that is basicaly the model we call word2vec.

  • \hat{y} = \textrm{softmax}(\theta_{t}e_{c} + b_{2}), a language model that we use to train weights of the word2vec model.

1 Like