Embedding layer, why is it needed?

Hi everyone,

This is not particularly tied to any assignment, but more like a question to improve my understanding.

I understand what embeddings are for: they are the internal representation of words in a smaller dimension for the model. The question is: for the LSTM and for RNN, why do you need an embedding layer before? I also saw they are used as the first layer for the sentiment classification.

What is the intuition behind placing them there? And how will the model use the layer (intuitively speaking) for predictions? I’m puzzled because there is a weight matrix and other training weights involved in the other layers, so I though the information there would be more important.

Thanks in advance

The embedding layer as far as I understand is just a mathematical space representation (vector space) of words and this has been arranged as such so that its easier to process compared to raw words (or someother encoding), there are also other qualities to it as you say.

These representations are feeds to the RNN which is then trained to find weights and biases so it can become able to predict. The embeddings themselves dont have much predictive power.

Is there an intuitive way to explain why it is “easier to process” with the embedding layer?

Let’s say If we had a regular input: e.g. [248, 1028, 1, 44, 0, 0] for a padded input sentence. Which abstraction power will the embedding layer provide to the network once this is fed in?

The advanrages are that they are already processed and optimized to be used further. Even if you use padded inputs you still have to build vector representations in your model.

Hi @Jose_Leal_Domingues

Let me offer you one analogy/intuition to illustrate what the embedding layer is and why it is useful.

As you correctly understand, the embedding layer is “internal representation of words in a smaller dimension for the model”. Here the “representation” is a vector (a list of numbers) and “smaller” means that it is smaller than the whole dictionary or any other very large number.

So if you want to “represent” a word that computer could work with, you need to convert it to vector of numbers. For example word “refrigerator” could be assigned a number 1028 by a tokenizer, which (the number 1028) then passed to an embedding layer could produce a vector of numbers like [3.14, 2.5, -0.2, 0.1, 1.2]. Here, the embedding dimension is 5 which means that all the words you could use are reduced to 5 dimensions. For a simplicity sake, we could “imagine/pretend” that the first dimension is “whiteness vs blackness” (+10 meaning very white, -10 meaning very black), second dimension could be “heavyness vs lightness” (10 very heavy, -10 very light), third could be “good vs evil” and so on. The bigger the embedding dimension the more ways you could assign value “x vs y”.

So if you pass a sentence “White refrigerator is evil.” The tokenizer would convert it to numbers [320, 1028, 5, 128, 1] or for example [320, 1028, 5, 128, 1, 0, 0, 0] with padding (shape (8,)).

When you pass this to an embedding layer, what it could return would be:

[[10, 0.01, 2.3, 0.1, 0.1], # White
[3.14, 2.5, -0.2, 0.1, 1.2], # Refrigerator
[0.02, 0.3, 1.3, -0.5, 5], # is
[-1.7, 1.7, -9.2, 0.1, 0.1] # evil
[-0.1, 0.1, -0.2, 0.1, 0.1] # .
[0.01, 0.01, 0.1, 0.1, 0.1] # [pad]
[0.01, 0.01, 0.1, 0.1, 0.1] # [pad]
[0.01, 0.01, 0.1, 0.1, 0.1]] # [pad]

Shape - (8, 5)

So the placement of the embedding layer in front (before passing tensors to RNN, MLP or whatever) is needed because it maps integers to their representations in the embedding space. So for example how an rnn could work with the sequence :

hidden state: [0, 0, 0, 0, 0],
input : [10, 0.01, 2.3, 0.1, 0.1], # White

hidden state: [2, 0, 0.4, 0.1, -2]

Next step:
hidden state: [2, 0, 0.4, 0.1, -2]
input : [3.14, 2.5, -0.2, 0.1, 1.2], # Refrigerator

hidden state [7, 1, -0.3, -0.1, -0.3]


Of course, the “whiteness vs blackness”, “heavyness vs lightness”, “good vs evil” is just for illustration purposes, in reality the embedding layer “tries it’s best” to assign values (for each embedding dimension, for each word), that minimize some loss.