How does trax word embedding layer work?

sandeep_dhankhar · September 14, 2022, 11:40am

We learned about the Continuous bag of words (CBOW) model in the previous course of the specialization. I am interested to know which method/algorithm Trax uses behind the scene to create word embeddings.

arvyzukai · September 15, 2022, 5:31am

Hi @sandeep_dhankhar

Trax does not use any particular model to create word embedding (like CBOW). Trax Embedding layer is mapping (assigning) a vector of values (initially random) to a particular number (eg. word token) and upgrading them through gradient descent in line with loss function.

You could find this post useful.

When you create Trax Embedding layer, you specify the number of tokens (this is number of rows - the number of tokens you will use, usually size of the vocabulary plus some special tokens) and the Embedding layer dimension (this is number of columns - how many values each vector should have). Initially these values are random. Then, when you train your model, these values are updated according to the loss function you specified - if your model guessed wrong or right, these values are updated - lowered or increased with varying magnitude accordingly. After your training you get your weight values for the Embedding layer (and also other layers).

sandeep_dhankhar · September 15, 2022, 6:47am

Thank you @arvyzukai. I get the intution of what is happening in the embedding layer.

Since the embedding layer code is not part of the programming exercise, I am wondering how it actually happens.

Is there a pictorial representation of what is going on in the embedding layer? How does it coordinate with the input (tweets), and later part of the neural network? I am having difficulty figuring out the dimensions of matrices in the embedding layer.

arvyzukai · September 15, 2022, 9:07am

The Embedding code is very simple - Embedding code

As you can see it is just one line of code for forward propagation:

jnp.take(self.weights, x, axis=0)

What it does, it just “takes” the x’th rows from the weight matrix (self.weights). So if you have Embedding matrix with vocabulary of length 20 and embedding dimension 4 (shape (20, 4)):

and you pass your batch of two sentences (for example, x of shape (2, 4):

the Embedding layer will return you shape (2, 4, 4) adds one dimension - the Embedding size dimension:

This is all that it does - it takes input of shape*(batch_size, seqlen)* and outputs the shape (batch_size, seqlen, emb_size).

sandeep_dhankhar · September 16, 2022, 8:31am

Understood, Thank you for the clear explanation @arvyzukai.

Mayank11 · July 29, 2023, 7:01am

In the assignment, vocab size is 9088 and embedding size is 256. Thus size of embedding matrix is 9088 x 256 = 2,318,336. Total 5000 pos and 5000 neg examples were used for training.

Are so few examples (compared to number of params to be trained) good enough to get good accuracy while training (we got ~ 100% accuracy in assignment !)?

Also, since embedding layer got trained from the very specific task of sentiment analysis, does it meant that these embeddings cannot be used in other use cases. If so, is there a way to train a more general embedding just like CBOW which preserved the semantic meaning in embeddings?

Basically, wouldn’t it have been better to use CBOW embeddings here? - Not only number of training params would have reduced drastically, the model would have built up on semantic meanings which is how we humans also process text…

Topic		Replies	Views
Question for the vector representation NLP with Attention Models week-module-1	3	564	April 27, 2023
Trax and mean layer NLP with Sequence Models week-module-1	4	590	December 3, 2022
DLS5 W2 Learing Word embeddings Sequence Models coursera-platform	7	556	September 12, 2023
How are word embedding calculated end to end NLP with Sequence Models week-module-1	6	612	January 10, 2023
Matrix size for every step NLP with Attention Models week-module-1	4	604	August 1, 2023

How does trax word embedding layer work?

Related topics