I have a question on the lecture on sentiment classification in week 2 of the sequence models course. In particular on slide 2:
In the beginning Andrew says for the sentence “The dessert is excellent” we assume a vocabulary of 10000 (as in all the examples). So a one-hot vector representation of one word would be of dimenstion (10000, 1).
Then when introducing the embedding matrix E, with a dimension of 300 embedding features, he mentions that it could be trained on a much larger dataset i.e. made up of 1 billion words.
From what I understand that E then would be a matrix of dimension (300, 1000000000), i.e. one embedding for every word in the dataset.
So the matrix multiplication does not work or in other words, how to I find one particular word of a smaller vocabulary in the embedding matrix of a much larger corpus?
I am sure I am getting something wrong. Would be happy if someone could go a bit more into detail about that.
Thank you so much for your help!
Please provide link to lecture and the timestamp.
There are 2 techniques people use when dealing with pretrained word embeddings:
- Initialize the embedding layer with the pretrained embedding weights and map training / test data words to match the word indices in the pretrained embedding for the lookup to work properly.
- Create a new embedding matrix and initialize the weights with pretrained weights only for the words in your training corpus. Advantage of this approach over the previous one is that you can potentially have a smaller embedding matrix if space is of concern.
For words in your corpus that are outside the pretrained embedding vocabulary, evaluate the following approaches:
- Map these words to
OOV (out of vocabulary) token.
- Initialize to random weights.
- Initialize to zeros.
When he says that the embedding model was trained on a billion words, that probably refers to the size of the training corpus, not the size of the vocabulary. Are there a billion unique words in English. I doubt it. But even if the vocabulary is what he’s talking about, you’ll have to subset the embedding to only include the words in your vocabulary as Balaji describes.
Hi Balaji, Hi Paul,
thank you for your help! I think I understand it now.
Perhaps you could think about adding it as a note to the lecture as from the slides itself it does not become clear that there is an intermediate step required to make the pretrained embeddings compatible with an individual problem.
Thank you again for your efforts!
You’re welcome, Max. The staff have been informed about your suggestion.