It has been mentioned a few times throughout the labs and the assignments that the number of LSTM units in Trax is designed to be exactly equal to the dimensionality of the embedding that is fed into the LSTM layer. I will appreciate if someone could please provide the intuition behind why such a constraint is necessary. Thank you in advance.
That is a good question and I donâ€™t have the good answer why is it the case (maybe others will elaborate on that). But in principle is does not have to be this way. Iâ€™ve implemented models in other DL frameworks where RNNs work with different dimensionality than the Embedding layer.
Yes  Exactly this question. Iâ€™ve seen elsewhere that it does not have to be the length of the Embedding, and iâ€™d love an explanation on the pros and cons (or at least reasoning) for doing so.
Initially, I imagine that the number of units in the LSTM would be the length of the longest sentence since there would be max_len inputs to the LSTM
Thanks!
First thing to note  LSTM units do not depend on sentence length  that should be fully understood in order to talk about Embedding and LSTM dimensions (they do not directly depend on sequence length, they process one token embedding at a time). Check my simple example on how LSTM process a sequence after embedding.
On the Embedding dimensionality correlation with LSTM dimensionality  it is not uncommon to have equal Embedding and LSTM dimension size but it is not a must. I think it is trax libraryâ€™s design choice to not implement this functionality because of a lack of demand from ML comunity or smth elseâ€¦ I donâ€™t know and nobody since then commented on that.
What is always the most important thing is the modelâ€™s performance and it depends on many things (especially data quality). Some of the many things you can do (but not limited to) is:
 You can increase or decrease Vocabulary size (by dropping some words for â€śUNKâ€ť etc.) but it should not directly influence your Embedding size. Usually your vocab size will be 10 000  100 000.
 You can increase or decrease Embedding size  compressing more or less wordâ€™s â€śexpressabilityâ€ť (lexics â†’ meaning) to a certain amount of numbers (each wordâ€™s meaning reduced to (vector length) dimensions). Quality of word embedding increases with higher dimensionality. But at some point the gains diminish. You need trial and error to find the right size for you. Also, your Embedding size will depend on tokens (words vs. subwords vs. characters), but usually it is 256  1 024.
 You can increase or decrease LSTM size  how to further compress or expand modelâ€™s ability to carry compressed information fromsteptostep (word to word in sentence). (note also, that number of layers can be a factor how hierarchically propagate the information. Also directionality is a factor.). This size mostly should depend on what is after this layer (is this layer the encoder, or is it for categorical predictions (is there a Dense layer after it and how big?), or smth else.).
Usually it would depend on the problem you are trying to solve (profit) and on costs (hardware / time / electricity)  smaller models tend to be cheaper and faster but tend to lack accuracy.
Thank you @arvyzukai. That is very helpful.
With respect to your example. the shape of the embedded sentence is (8,5), where 8=number of words in sentence including PADs and 5=length of embedding.
In your example, the 8 embedded words (one token at a time) EACH go through the LSTM, so shouldnâ€™t the number of LSTM units be 8? Whereas the number of units based on embedding size would be 5? If it is 5, then how do each of the 8 words get to go through the LSTM to produce an output based on the previous word?
No, as I said, it is very important to understand that sequence length does not influence RNN (including LSTM) dimensions (at least directly). In other words, it doesnâ€™t matter if the longest sentence in your dataset is 1 000 words or 30.
It is better to explain with simple RNN, because LSTM has more weight matrices and activations. Also LSTM has one more hidden state which makes the explanations more complicated. A simple RNN would illustrate the main idea better, which works in a simillar way to an LSTM.
Note, that here, for the sake of illustration (to go from top to bottom and to save space), the dimensions of vectors and matrices are different than they would be in a real world. Leaving this aside, the most important thing to see is that all the weights are same for each step (word). (The RNN weights are in a yellow frame).
For example, I highlighted the second step:
 after embedding, the input x_2 is of shape (1, 5) and it is multiplied by W_{xh} (shape 5, 4), which produces the shape (1, 4);
 previous hidden state h_0 is of shape (1, 4), and it is multiplied by W_{hh} (shape 4, 4), which produces the shape (1, 4);
 both outputs are summed, which does not change shape (1, 4);
tanh
is applied on the sum, which also does not change shape (1, 4);
The result is h_2  hidden state for the next step
For the sake of completeness, I can share my own calculations to check the inner workings of this weeks C3_W3 assignment. Maybe someone will find it useful.

The example of a batch:

The example of the Embedding weights:

The first sentence embedded example:
Note that the same words have the same embeddings (highlighted in blue and orange). 
The example of LSTM input weights for first layer W_ih_l0:

The example of LSTM hidden state weights for fist layer W_hh_l0:

The example of LSTM biases (for both input and hidden state):
The example of calculations :

t = 0 (â€śThousandsâ€ť)

t = 1 (â€śofâ€ť)

t = 2 (â€śdemonstratorsâ€ť)

t = 17 (note jump to step 18  the word â€śofâ€ť)
Note:
You can compare the different values between words â€śofâ€ť in step t=1 and step t=17. Note that inputs (the embeddings are the same) but because of different hidden states c_16 and h_16, the output is different.
The example output of LSTM for the first sentence:
 The example of Linear (Dense) layer weights (W and b):
The output of the model:
@arvyzukai  First of all. Thank you very much. These are great examples and are very helpful.
One (hopefully) last question. I think my confusion might stem from use of the terms â€śunitsâ€ť vs â€śstepsâ€ť. I understand now that the number of units in an LSTM is not dependent on the length of the sentence. However in your example with the Blue Boxes and the Yellow Boxes (for the weights), each of those Blue boxes is a â€śstepâ€ť.
When configuring the LSTM Layer, we pass in the â€śn_unitsâ€ť number. Does Trax implicitly know that it needs to have the number of steps for each word in the sentence? I assumed I needed to specify that somewhere (similarly to how in a CNN we need to specify how many Convolutional Layers are needed), but maybe thatâ€™s not correct.
thank you again for the thorough examples
For me personally, the term â€śn_unitsâ€ť is a poor choice and it needlessly causes confusion. If it was named â€śn_dimâ€ť I think it would have been clearer, but I guess whoever named it, they had their reasonsâ€¦
To get back to your question:
The â€śn_unitsâ€ť in the â€śRefrigeratorâ€ť example is 4  output from the rnn is vector of size 4 for each step. In the LSTM example, the â€śn_unitsâ€ť is 40  each step produces vector of size 40.
Loosely speaking, the whole LSTM layer produces (n_steps, n_units) matrix. Like in the picture â€śThe example output of LSTM for the first sentence:â€ť.
If you have more layers then the first layerâ€™s output is next layerâ€™s input and â€śn_unitsâ€ť is the same number for all layers. The output dimensionality does not depend on number of layers. Usually training is done in batches, so the output would be (batch_size, n_steps, n_units).
We pass â€śn_unitsâ€ť number to trax so that it would know what the dimensions of weight matrices should be.
Thank you! This was all incredibly helpful
@arvyzukai  thanks for taking the time to address my question. And thanks to @pneff for developing the discussion.
I agree with the original question. The topic seemed to get a bit sidetracked with explaining other aspects of the parameters, but the original question in unanswered.
It is quite unclear why TRAX has chosen to limit the output dimension in this way. There seems to be no need for this, but it is quite clearly checked and hardcoded to restrict the choice of n_units. Is there anyone who can explain why this is?
To answer the original question is not that easy because you have to ask the trax creators themselves. But if you want to change the dimensions you can easily add a Dense layer and that would solve the problem.