When we are learning word embedding models like word2vec, glove our primary goal is to find the word embeddings such that two similar words can have similar embeddings. But in a skip-gram model (X → Y) we are using the Embedding vector (e) as the input which eventually needs to be derived. Could someone please explain how we are using this E to derive the embedding vector.
Let’s take a look at skip-gram with a simple network below.
vocab_size = 3000
embeding_size = 100
model = tf.keras.Sequential([
tf.keras.Input(shape =(vocab_size,), name='context_word'),
tf.keras.layers.Dense(embeding_size, use_bias=False, name='embedding'),
tf.keras.layers.Dense(vocab_size, activation='softmax', name='target_word')
] name='skip-gram')
model.summary()
This is a simple version skip-gram model. Just like Andrew said in lecture, the input is a one-hot vector with vocabulary size (3000 in the case), the output of hidden layer (embedding layer) is an embedding vector with embedding size (100 in the case), the Parameter # of embedding layer is the size of embedding matrix E (3000 x 100 in the case), and the output of output layer is a target word (Parameter is theta.)
Hopefully, it’s helpful.
Shouldn’t size of embedding matrix be 100 X 3000 ?
I had a similar question after watching Sequence Models > Week 2 > Learning Word Embeddings multiple times. It wasn’t immediately obvious to me that the embeddings ARE the parameters that are being trained (or at least some of them). Instead, the way the process was mechanically described it seemed more like the embedding matrix was an input, known a priori, and the actual parameters being traded were those of the neural net and softmax layers.
Pedagogically, I think it might have been helpful to see just a quick “turn of the crank” of what gets updated in each forward-prop/backward-prop cycle (without any need to go deep into the derivatives, etc.), plus an explicit mention that once the model is trained only the embedding matrix is relevant and the dense layer and output layer parameters aren’t really relevant.
(Of course, if I’ve gotten any of that wrong, I’m more than happy to be corrected!)