Do we need matrix multiplication to get a word embedding?

To get the embedding vector, the lecture says that:

`E * one-hot-vector`

(300, 10000) * (10000, 1) = (300, 1)

So my question, why do we need to do this matrix multiplication? Can we just use the index of a word to retrieve the column vector from the matrix E? It has the same effect but doesn’t need matrix multiplication. If the E is stored in a dictionary, then just use the key (word index in the vocabulary) to retrieve the embedding vector?

Hello @Martinmin, I can see the rationale. To further supported that idea, some speed tests would be favourable. For example, given a (300, 10000) embedding matrix, 1000 indices, and then through approach A (matrix multiplication) or B (dictionary), it produces a (300, 1000) chosen embeddings. How long do the two approaches take?


For batch mode, probably matrix multiplication is more efficient. Another reason using one-hot may be for the purpose of using it as an input layer in a network for training. So a related question on the embedding matrix E (E * one-hot) in several slides is that, this E is to be learnt by the network. It doesn’t exist in the first place. It is the parameter matrix in the hidden layer. Is that right?

Like the Dense layer, Embedding is by itself also a layer, which accepts index as input and returns the embedding as output. I prefer to use the more accurate names: Dense layer and Embedding layer rather than hidden layer.

Just like a dense layer is randomly initiated to a certain shape of weights, the embedding layer is also randomly initiated to a certain shape of weights which is, as examplified by that slide, (300, 10000) depending on the number of words and embedding dimensions.

Lastly, just like we can update a Dense layer’s weights by gradient descent, we can also update the embedding layer’s weights by gradient descent.