Natural Language Processing & Word Embeddings

saiprasadreddy · July 2, 2021, 1:05pm

When we are learning word embedding models like word2vec, glove our primary goal is to find the word embeddings such that two similar words can have similar embeddings. But in a skip-gram model (X → Y) we are using the Embedding vector (e) as the input which eventually needs to be derived. Could someone please explain how we are using this E to derive the embedding vector.

edwardyu · July 5, 2021, 4:10am

Let’s take a look at skip-gram with a simple network below.

vocab_size = 3000
embeding_size = 100
model = tf.keras.Sequential([
    tf.keras.Input(shape =(vocab_size,), name='context_word'),
    tf.keras.layers.Dense(embeding_size, use_bias=False, name='embedding'),
    tf.keras.layers.Dense(vocab_size, activation='softmax', name='target_word')
] name='skip-gram')
model.summary()

This is a simple version skip-gram model. Just like Andrew said in lecture, the input is a one-hot vector with vocabulary size (3000 in the case), the output of hidden layer (embedding layer) is an embedding vector with embedding size (100 in the case), the Parameter # of embedding layer is the size of embedding matrix E (3000 x 100 in the case), and the output of output layer is a target word (Parameter is theta.)
Hopefully, it’s helpful.

Mohammed_Ifreen · March 6, 2022, 7:16am

Shouldn’t size of embedding matrix be 100 X 3000 ?

mnjrrz · May 30, 2025, 4:37pm

I had a similar question after watching Sequence Models > Week 2 > Learning Word Embeddings multiple times. It wasn’t immediately obvious to me that the embeddings ARE the parameters that are being trained (or at least some of them). Instead, the way the process was mechanically described it seemed more like the embedding matrix was an input, known a priori, and the actual parameters being traded were those of the neural net and softmax layers.

Pedagogically, I think it might have been helpful to see just a quick “turn of the crank” of what gets updated in each forward-prop/backward-prop cycle (without any need to go deep into the derivatives, etc.), plus an explicit mention that once the model is trained only the embedding matrix is relevant and the dense layer and output layer parameters aren’t really relevant.

(Of course, if I’ve gotten any of that wrong, I’m more than happy to be corrected!)

Topic		Replies	Views
Learning word embeddings Sequence Models	1	618	August 4, 2021
Embedding Layer input and output meaning Natural Language Processing in TensorFlow week-2 , week-3 , week-4	5	729	April 17, 2022
C3 W1 Assignment Model intuition NLP with Sequence Models week-1	1	507	December 29, 2022
Word embedding parameters + transfer learning Sequence Models	3	584	May 24, 2022
Conflicting data on Embedding() input/output Sequence Models	1	494	October 4, 2022

Natural Language Processing & Word Embeddings

Related topics