Is there a reason why we’re using the weights W1 and W2 or their mean to get the word embeddings? Intuitively, I was expecting the (Nx1) values in the hidden layer to be the obvious choice. Any particular reason for choosing the weights to represent words or is it because it just works?
Good question.
Couple of points to get the intuition right:
-
First, embeddings are numerical representation of words.
Why numerical? - because computers work with math and math needs numbers. -
Second, people came up with a myriad of ways to represent words with math (or as numbers). One of them is CBOW - the approach of predicting the word by the surrounding words (and where the order of the surrounding words does not matter). Which is a simple approach, but for some tasks it might be enough.
The logic behind this approach lies in distributional hypothesis which states that words in similar contexts have similar meanings. In particular, we try to increase the quantity v_w · v_c for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other.
One example to illustrate for further intuition - we could use Pointwise mutual information which is very easy to understand with the table:
The problem with this approach is that for every word we would get a very long embedding vector - the vocabulary long embedding vector (for example 30 000 long) just for 1 word context. If we would consider multiple word context, the number of rows (or columns, depending on the implementation) in this matrix would rise exponentially (maybe not to 30 000 x 30 000 but exponentially). In other words, impossible to work with (computation wise and also not enough text in the world to populate the matrix to at least a decent accuracy).
Now, regarding your question, one trade-off that can be made is that we actually don’t need that many dimensions (or features) to represent a word. This number (among other things) is left for you to decide. The other thing (among other things) is the architecture of the Neural Network (if you decided to use NN to get these representations).
So, if you decided to go with a single layer Neural Network, then your question would be irrelevant (since there would be no W2). But, if you decided to go with a two layer NN (like in the Assignment) then you have options (the three you mentioned and also other combinations). Think what happens when you have more than two layers and also of different types (like RNN, Attention, etc)…
Reality is that none of them (and all of them at the same time) is a correct way to represent words. If the weights are initialized randomly (as is often the case) you never arrive at the same values even with the same dataset which suggests that there is no one correct way to represent the same words but at the same time, these weights are good enough to represent these words for your objective. Later, when you want to interpret them, you have many options but you have to keep in mind that these weights are dependent on each other, on the dataset, on the NN architecture (number and types of layers), on the loss function, on the optimizer (Stochastic Gradient Descent or Adam), on many hyper-parameters (number of dimensions, learning rate, etc.etc.), and among other things - on the initial random state.
So, to make the story short - the best intuition is to understand the underlying processes that led to these values (because there is no “obvious choice”).
Cheers
Thank you so much for your comprehensive and in-depth reply!
So far, I’ve only seen hidden layer representations being used as encodings/embeddings, so I was just taken by surprise when the weights were used instead.
I greatly appreciate your amazing explanation. Cheers!