Number of LSTM units in Trax

It has been mentioned a few times throughout the labs and the assignments that the number of LSTM units in Trax is designed to be exactly equal to the dimensionality of the embedding that is fed into the LSTM layer. I will appreciate if someone could please provide the intuition behind why such a constraint is necessary. Thank you in advance.

1 Like

Hi @Davit_Khachatryan

That is a good question and I don’t have the good answer why is it the case (maybe others will elaborate on that). But in principle is does not have to be this way. I’ve implemented models in other DL frameworks where RNNs work with different dimensionality than the Embedding layer.

Yes - Exactly this question. I’ve seen elsewhere that it does not have to be the length of the Embedding, and i’d love an explanation on the pros and cons (or at least reasoning) for doing so.

Initially, I imagine that the number of units in the LSTM would be the length of the longest sentence since there would be max_len inputs to the LSTM


First thing to note - LSTM units do not depend on sentence length - that should be fully understood in order to talk about Embedding and LSTM dimensions (they do not directly depend on sequence length, they process one token embedding at a time). Check my simple example on how LSTM process a sequence after embedding.

On the Embedding dimensionality correlation with LSTM dimensionality - it is not uncommon to have equal Embedding and LSTM dimension size but it is not a must. I think it is trax library’s design choice to not implement this functionality because of a lack of demand from ML comunity or smth else… I don’t know and nobody since then commented on that.

What is always the most important thing is the model’s performance and it depends on many things (especially data quality). Some of the many things you can do (but not limited to) is:

  • You can increase or decrease Vocabulary size (by dropping some words for “UNK” etc.) but it should not directly influence your Embedding size. Usually your vocab size will be 10 000 - 100 000.
  • You can increase or decrease Embedding size - compressing more or less word’s “expressability” (lexics → meaning) to a certain amount of numbers (each word’s meaning reduced to (vector length) dimensions). Quality of word embedding increases with higher dimensionality. But at some point the gains diminish. You need trial and error to find the right size for you. Also, your Embedding size will depend on tokens (words vs. subwords vs. characters), but usually it is 256 - 1 024.
  • You can increase or decrease LSTM size - how to further compress or expand model’s ability to carry compressed information from-step-to-step (word to word in sentence). (note also, that number of layers can be a factor how hierarchically propagate the information. Also directionality is a factor.). This size mostly should depend on what is after this layer (is this layer the encoder, or is it for categorical predictions (is there a Dense layer after it and how big?), or smth else.).

Usually it would depend on the problem you are trying to solve (profit) and on costs (hardware / time / electricity) - smaller models tend to be cheaper and faster but tend to lack accuracy.

Thank you @arvyzukai. That is very helpful.

With respect to your example. the shape of the embedded sentence is (8,5), where 8=number of words in sentence including PADs and 5=length of embedding.

In your example, the 8 embedded words (one token at a time) EACH go through the LSTM, so shouldn’t the number of LSTM units be 8? Whereas the number of units based on embedding size would be 5? If it is 5, then how do each of the 8 words get to go through the LSTM to produce an output based on the previous word?

No, as I said, it is very important to understand that sequence length does not influence RNN (including LSTM) dimensions (at least directly). In other words, it doesn’t matter if the longest sentence in your dataset is 1 000 words or 30.

It is better to explain with simple RNN, because LSTM has more weight matrices and activations. Also LSTM has one more hidden state which makes the explanations more complicated. A simple RNN would illustrate the main idea better, which works in a simillar way to an LSTM.

Note, that here, for the sake of illustration (to go from top to bottom and to save space), the dimensions of vectors and matrices are different than they would be in a real world. Leaving this aside, the most important thing to see is that all the weights are same for each step (word). (The RNN weights are in a yellow frame).

For example, I highlighted the second step:

  1. after embedding, the input x_2 is of shape (1, 5) and it is multiplied by W_{xh} (shape 5, 4), which produces the shape (1, 4);
  2. previous hidden state h_0 is of shape (1, 4), and it is multiplied by W_{hh} (shape 4, 4), which produces the shape (1, 4);
  3. both outputs are summed, which does not change shape (1, 4);
  4. tanh is applied on the sum, which also does not change shape (1, 4);

The result is h_2 - hidden state for the next step

For the sake of completeness, I can share my own calculations to check the inner workings of this weeks C3_W3 assignment. Maybe someone will find it useful.

The example of calculations :


You can compare the different values between words “of” in step t=1 and step t=17. Note that inputs (the embeddings are the same) but because of different hidden states c_16 and h_16, the output is different.

The example output of LSTM for the first sentence:

The output of the model:

1 Like

@arvyzukai - First of all. Thank you very much. These are great examples and are very helpful.

One (hopefully) last question. I think my confusion might stem from use of the terms “units” vs “steps”. I understand now that the number of units in an LSTM is not dependent on the length of the sentence. However in your example with the Blue Boxes and the Yellow Boxes (for the weights), each of those Blue boxes is a “step”.

When configuring the LSTM Layer, we pass in the “n_units” number. Does Trax implicitly know that it needs to have the number of steps for each word in the sentence? I assumed I needed to specify that somewhere (similarly to how in a CNN we need to specify how many Convolutional Layers are needed), but maybe that’s not correct.

thank you again for the thorough examples

For me personally, the term “n_units” is a poor choice and it needlessly causes confusion. If it was named “n_dim” I think it would have been clearer, but I guess whoever named it, they had their reasons…

To get back to your question:
The “n_units” in the “Refrigerator” example is 4 - output from the rnn is vector of size 4 for each step. In the LSTM example, the “n_units” is 40 - each step produces vector of size 40.

Loosely speaking, the whole LSTM layer produces (n_steps, n_units) matrix. Like in the picture “The example output of LSTM for the first sentence:”.

If you have more layers then the first layer’s output is next layer’s input and “n_units” is the same number for all layers. The output dimensionality does not depend on number of layers. Usually training is done in batches, so the output would be (batch_size, n_steps, n_units).

We pass “n_units” number to trax so that it would know what the dimensions of weight matrices should be.

Thank you! This was all incredibly helpful

@arvyzukai - thanks for taking the time to address my question. And thanks to @pneff for developing the discussion.

I agree with the original question. The topic seemed to get a bit side-tracked with explaining other aspects of the parameters, but the original question in unanswered.

It is quite unclear why TRAX has chosen to limit the output dimension in this way. There seems to be no need for this, but it is quite clearly checked and hard-coded to restrict the choice of n_units. Is there anyone who can explain why this is?

Hi @Laurenz_Eveleens

To answer the original question is not that easy because you have to ask the trax creators themselves. But if you want to change the dimensions you can easily add a Dense layer and that would solve the problem.