Are there dense layers in word2vec?

Meir · June 21, 2021, 9:30am

The Probabilistic Language Model has dense layers preceding the softmax layer. Such dense layers are not shown in the picture that Andrew drew for word2vec:

Is that on purpose or there are really dense layers in word2vec as well?

TMosh · June 28, 2021, 4:50am

Yes, there is an implied dense layer there, that’s where the theta values come from.

Meir · June 28, 2021, 5:13am

In the Probabilistic Language Model, there is more than one. Why would it not be the case in Word2Vec?

TMosh · June 28, 2021, 6:00am

Perhaps there is.
Yes, I’ll ask another mentor to handle your question.

manifest · June 28, 2021, 7:52am

Hey @Meir,

In Word2Vec paper they presented two model architectures: CBOW model and continuous skip-gram model. They both have a single hidden layer that aim to learn word representations.

In general, you are free to use more hidden layers. As far as I understood they decided on a single layer just for the sake of efficiency.

manifest · June 30, 2021, 10:27am

Just to clarify the video example,

o_{c} \to E \to e_{c} \to \textrm{softmax} \to \hat{y}

is equal to

\hat{y} = \textrm{softmax}(Eo_{c} + b)

Where

o_{c} is an input vector represented as one-hot vector.
E is a trainable weights matrix.
Eo_{c} + b is a linear (or dense) layer that outputs a vector e_{c} of non-normalized probabilities.
\textrm{softmax}(e_{c}) layer that outputs a vector \hat{y} of normalized probabilities.

CharmingQuark · September 23, 2021, 9:24pm

Hi @manifest, this is very helpful. Can you clarify some more?

o_c is a one-hot vector the length of the dictionary
E is a weights matrix that outputs a feature vector e_c of arbitrary length (300 in lecture)
Eo_c + b outputs a vector of non-normalized probabilities the same length as the dictionary, one element per word.

How does the 300-element feature vector become the 10,000-element vector of probabilities? Are the two trainable weights matrices E different from each other?

I feel that this has something to do with \theta_t, but I haven’t quite grasped it.

manifest · September 24, 2021, 4:23pm

Hey @CharmingQuark,

You are right. There are actually two models
I guess in the lecture, they just didn’t want to complicate things.

Given a vocabulary of size 10000, m being a batch size and T_{x} being the sequence size, we have the following shapes for the input one-hot vector o_{c} and the weights matrix E:

o_{c} \in \mathbb{R}^{m \times T_{x} \times 10000}
E \in \mathbb{R}^{300 \times T_{x}}

The dot product of these matrices, e_c = E o_{c}, will be of the following shape:

e_c \in \mathbb{R}^{m \times 300 \times 10000}

In the lecture, \textrm{softmax} is not just a function, but a layer meaning that it includes a linear (or dense) layer with trainable weights \theta_t. It is of the following shape:

\theta_{t} \in \mathbb{R}^{1 \times 300}

The dot product of these matrices, \hat{y} = \theta_{t}e_c, gives us the 10000-element vector of probabilities:

\hat{y} \in \mathbb{R}^{1 \times 1 \times 10000}

So, there are actually two models stacked together:

e_{c} = Eo_{c} + b_{1}, that is basicaly the model we call word2vec.
\hat{y} = \textrm{softmax}(\theta_{t}e_{c} + b_{2}), a language model that we use to train weights of the word2vec model.

Topic		Replies	Views
Week 1, Assignment 1, Ex3, No 'dense layer' implemented? Sequence Models coursera-platform	1	529	August 15, 2021
What is the dense layer for in week 3 assignment? NLP with Sequence Models week-module-3	3	586	April 11, 2022
About word embeddings in the CBOW model NLP with Probabilistic Models week-module-4	1	524	December 1, 2022
Possible Discrepancy in Markdown: Assignment NLP with Sequence Models week-module-1	8	524	December 29, 2022
Why is Units same as size of vocabulary for dense layer Doubt NLP with Attention Models week-module-1	6	248	April 2, 2024

Are there dense layers in word2vec?

Related topics