[Week 2] - Embedding and Transfer Learning


I had a few questions about word embedding and transfer learning:

1: We’ve been told that the embedded matrix is learned and that the axes are hard to decipher. So how is the word embedding vector for previously unknown words (e.g. durian) constructed? Is a human sitting and assigning values based on synonyms eg. durian embedding vector = orange embedding vector?

2: Professor mentions that a text corpus of 100B words could be used to train the embedding matrix. What is the size of the embedding matrix? (300 X Unique words in corpus). Do we take a subset of this corpus and embedding to train the RNN or do we take just the embedding and hope the words from the embedding show up in the training dataset?

Thank you.

Hey @AfoDubhashi, great questions!

As a side note, a word embedding is a dense representation of a word. But we also can decide to encode letters, subwords, or even sentences instead.

To create word embeddings:

  • We need to tokenize words of our vocabulary first. We also add a special token [OOV] (out of vocabulary) to represent unknown words that may be seen during inference.
  • There are many ways to create embeddings from tokens. The modern approach would be to pass the tokens through a neural network and then use weights of the inner layer of the network as a dense vector representations of the words.

Dimensionality of the embedding may be any of your choice. For large networks we usually have large embedding dimensions, because such networks are able to learn more.

We usually learn embeddings with minibatch stochastic methods. That means we don’t need to have entire corpus in memory, we only need to have a single minibatch.

Say we have a separate token for OOV or UNK. This can take on any shape or form in a sentence, like a joker in a pack of cards. How is it that we are able to know durian is semantically similar to orange if the former has an OOV embedding? I guess my question is how do we construct embeddings for words that do not occur in the vocabulary other than using an OOV which obviously has no meaning.

With word embeddings – [OOV] is the only option. That’s one of the reasons why we tokenize subwords in practice. If you tokenize letters, you don’t have such problem at all – your vocabulary is finite.

Noted. Thank you. Is there an article/literature I should be reading to better understand this? I guess I don’t have a firm grasp on this topic

In my opinion, Speech and Language Processing the best NLP book at the moment.

1 Like