How does trax word embedding layer work?

We learned about the Continuous bag of words (CBOW) model in the previous course of the specialization. I am interested to know which method/algorithm Trax uses behind the scene to create word embeddings.

1 Like

Hi @sandeep_dhankhar

Trax does not use any particular model to create word embedding (like CBOW). Trax Embedding layer is mapping (assigning) a vector of values (initially random) to a particular number (eg. word token) and upgrading them through gradient descent in line with loss function.

You could find this post useful.

When you create Trax Embedding layer, you specify the number of tokens (this is number of rows - the number of tokens you will use, usually size of the vocabulary plus some special tokens) and the Embedding layer dimension (this is number of columns - how many values each vector should have). Initially these values are random. Then, when you train your model, these values are updated according to the loss function you specified - if your model guessed wrong or right, these values are updated - lowered or increased with varying magnitude accordingly. After your training you get your weight values for the Embedding layer (and also other layers).

Thank you @arvyzukai. I get the intution of what is happening in the embedding layer.

Since the embedding layer code is not part of the programming exercise, I am wondering how it actually happens.

Is there a pictorial representation of what is going on in the embedding layer? How does it coordinate with the input (tweets), and later part of the neural network? I am having difficulty figuring out the dimensions of matrices in the embedding layer.

The Embedding code is very simple - Embedding code

As you can see it is just one line of code for forward propagation:

jnp.take(self.weights, x, axis=0)

What it does, it just “takes” the x’th rows from the weight matrix (self.weights). So if you have Embedding matrix with vocabulary of length 20 and embedding dimension 4 (shape (20, 4)):

and you pass your batch of two sentences (for example, x of shape (2, 4):

the Embedding layer will return you shape (2, 4, 4) adds one dimension - the Embedding size dimension:

This is all that it does - it takes input of shape*(batch_size, seqlen)* and outputs the shape (batch_size, seqlen, emb_size).

1 Like

Understood, Thank you for the clear explanation @arvyzukai.

In the assignment, vocab size is 9088 and embedding size is 256. Thus size of embedding matrix is 9088 x 256 = 2,318,336. Total 5000 pos and 5000 neg examples were used for training.

Are so few examples (compared to number of params to be trained) good enough to get good accuracy while training (we got ~ 100% accuracy in assignment !)?

Also, since embedding layer got trained from the very specific task of sentiment analysis, does it meant that these embeddings cannot be used in other use cases. If so, is there a way to train a more general embedding just like CBOW which preserved the semantic meaning in embeddings?

Basically, wouldn’t it have been better to use CBOW embeddings here? - Not only number of training params would have reduced drastically, the model would have built up on semantic meanings which is how we humans also process text…