Embedding layer, why is it needed?

Hi @Jose_Leal_Domingues

Let me offer you one analogy/intuition to illustrate what the embedding layer is and why it is useful.

As you correctly understand, the embedding layer is “internal representation of words in a smaller dimension for the model”. Here the “representation” is a vector (a list of numbers) and “smaller” means that it is smaller than the whole dictionary or any other very large number.

So if you want to “represent” a word that computer could work with, you need to convert it to vector of numbers. For example word “refrigerator” could be assigned a number 1028 by a tokenizer, which (the number 1028) then passed to an embedding layer could produce a vector of numbers like [3.14, 2.5, -0.2, 0.1, 1.2]. Here, the embedding dimension is 5 which means that all the words you could use are reduced to 5 dimensions. For a simplicity sake, we could “imagine/pretend” that the first dimension is “whiteness vs blackness” (+10 meaning very white, -10 meaning very black), second dimension could be “heavyness vs lightness” (10 very heavy, -10 very light), third could be “good vs evil” and so on. The bigger the embedding dimension the more ways you could assign value “x vs y”.

So if you pass a sentence “White refrigerator is evil.” The tokenizer would convert it to numbers [320, 1028, 5, 128, 1] or for example [320, 1028, 5, 128, 1, 0, 0, 0] with padding (shape (8,)).

When you pass this to an embedding layer, what it could return would be:

[[10, 0.01, 2.3, 0.1, 0.1], # White
[3.14, 2.5, -0.2, 0.1, 1.2], # Refrigerator
[0.02, 0.3, 1.3, -0.5, 5], # is
[-1.7, 1.7, -9.2, 0.1, 0.1] # evil
[-0.1, 0.1, -0.2, 0.1, 0.1] # .
[0.01, 0.01, 0.1, 0.1, 0.1] # [pad]
[0.01, 0.01, 0.1, 0.1, 0.1] # [pad]
[0.01, 0.01, 0.1, 0.1, 0.1]] # [pad]

Shape - (8, 5)

So the placement of the embedding layer in front (before passing tensors to RNN, MLP or whatever) is needed because it maps integers to their representations in the embedding space. So for example how an rnn could work with the sequence :

In:
hidden state: [0, 0, 0, 0, 0],
input : [10, 0.01, 2.3, 0.1, 0.1], # White

Out:
hidden state: [2, 0, 0.4, 0.1, -2]

Next step:
In:
hidden state: [2, 0, 0.4, 0.1, -2]
input : [3.14, 2.5, -0.2, 0.1, 1.2], # Refrigerator

Out:
hidden state [7, 1, -0.3, -0.1, -0.3]

etc.

Of course, the “whiteness vs blackness”, “heavyness vs lightness”, “good vs evil” is just for illustration purposes, in reality the embedding layer “tries it’s best” to assign values (for each embedding dimension, for each word), that minimize some loss.

3 Likes