Embedding layer, why is it needed?

arvyzukai · September 12, 2022, 10:01am

Let me offer you one analogy/intuition to illustrate what the embedding layer is and why it is useful.

As you correctly understand, the embedding layer is “internal representation of words in a smaller dimension for the model”. Here the “representation” is a vector (a list of numbers) and “smaller” means that it is smaller than the whole dictionary or any other very large number.

So if you want to “represent” a word that computer could work with, you need to convert it to vector of numbers. For example word “refrigerator” could be assigned a number 1028 by a tokenizer, which (the number 1028) then passed to an embedding layer could produce a vector of numbers like [3.14, 2.5, -0.2, 0.1, 1.2]. Here, the embedding dimension is 5 which means that all the words you could use are reduced to 5 dimensions. For a simplicity sake, we could “imagine/pretend” that the first dimension is “whiteness vs blackness” (+10 meaning very white, -10 meaning very black), second dimension could be “heavyness vs lightness” (10 very heavy, -10 very light), third could be “good vs evil” and so on. The bigger the embedding dimension the more ways you could assign value “x vs y”.

So if you pass a sentence “White refrigerator is evil.” The tokenizer would convert it to numbers [320, 1028, 5, 128, 1] or for example [320, 1028, 5, 128, 1, 0, 0, 0] with padding (shape (8,)).

When you pass this to an embedding layer, what it could return would be:

[[10, 0.01, 2.3, 0.1, 0.1], # White
[3.14, 2.5, -0.2, 0.1, 1.2], # Refrigerator
[0.02, 0.3, 1.3, -0.5, 5], # is
[-1.7, 1.7, -9.2, 0.1, 0.1] # evil
[-0.1, 0.1, -0.2, 0.1, 0.1] # .
[0.01, 0.01, 0.1, 0.1, 0.1] # [pad]
[0.01, 0.01, 0.1, 0.1, 0.1] # [pad]
[0.01, 0.01, 0.1, 0.1, 0.1]] # [pad]

Shape - (8, 5)

So the placement of the embedding layer in front (before passing tensors to RNN, MLP or whatever) is needed because it maps integers to their representations in the embedding space. So for example how an rnn could work with the sequence :

In:
hidden state: [0, 0, 0, 0, 0],
input : [10, 0.01, 2.3, 0.1, 0.1], # White

Out:
hidden state: [2, 0, 0.4, 0.1, -2]

Next step:
In:
hidden state: [2, 0, 0.4, 0.1, -2]
input : [3.14, 2.5, -0.2, 0.1, 1.2], # Refrigerator

Out:
hidden state [7, 1, -0.3, -0.1, -0.3]

etc.

Of course, the “whiteness vs blackness”, “heavyness vs lightness”, “good vs evil” is just for illustration purposes, in reality the embedding layer “tries it’s best” to assign values (for each embedding dimension, for each word), that minimize some loss.

Topic		Replies	Views
LSTM Layer in Siamese Network NLP with Sequence Models week-module-4	1	648	September 14, 2022
Mean Layer in C3_W2_Assignment NLP with Sequence Models week-module-2	4	490	May 23, 2023
Creating embeddings of entire tweets NLP with Sequence Models week-module-1	5	524	February 15, 2023
Word embedding as input Sequence Models coursera-platform	1	305	December 18, 2023
Question on Sentiment Classification Lecture Sequence Models week-module-2 , coursera-platform	6	318	January 19, 2024

Embedding layer, why is it needed?

Related topics