LSTM and encoder_layers

Fei_Li · April 18, 2023, 6:11am

Hello, I have trouble understanding this piece of code in the assignment, specifically, I don’t quite get the variable meaning for d_model and n_encoder_layers. also, why use a for loop here?

feed the embeddings to the LSTM layers. It is a stack of n_encoder_layers LSTM layers

    [tl.LSTM(d_model) for _ in range(n_encoder_layers)]

this piece of code comes from

(Solution code removed, as posting it publicly is against the honour code of this community, regardless if it is correct or not)

If I take the exam from the lecture:

say, each of the input words, It’s time for tea. is d_model dimensional, n_ecoder_layers here would be 4? since there are four words?

then why the hint says It is a stack of n_encoder_layers LSTM layers? I guess I have some misunderstandings here.

also how do we stack LSTM? and how do we stack them? I searched on the internet for a little bit, is it something like the following?

arvyzukai · April 18, 2023, 10:15am

Hi @Fei_Li

In can offer an explanation with simple words: d_model is how many number each vector has (how many numbers is enough to represent each word/state). So in your pictures, the arrows would carry a vector of 4 numbers in your example (like [0.1, -0.3, 0.5, 0.8]).

As for the second layer, the bottom picture visualize what happens - the bottom layer passes the vector (modified) to the upper layer (the upper layer does not receive the embed layer number, but the layer from below).

Yes, you have a common misunderstanding - the number of words in this case has nothing to do with d_model or n_encoder_layers.

The for loop is just to propagate the list. For example, in this case it could have been:
[tl.LSTM(d_model), tl.LSTM(d_model)] and that would have been equivalent, but for a more general you would use the list comprehension as in the exercise.

In code, just like in your example. Generally for a better performance - more layers tend to have better performance (when the dataset is complex enough, more layers tend to better capture the complexity).

If you have further questions feel free to ask.
Cheers

Fei_Li · April 18, 2023, 1:13pm

Thank you so much for your help. I think I get it.
So [tl.LSTM(d_model) for _ in range(n_encoder_layers)], this code gets embeddings for each word (It’s time for tea, each word is represented by a vector of four as your example shows), then sends to Lstm, then another Lstm, then another…(there are n_encouder_layers lstm layers stacked together for each word)

I draw my thoughts in the pic below.

I think I got what you mean. Am I right?

arvyzukai · April 18, 2023, 4:09pm

Well, to be more precise:

technically speaking embeddings are after the embedding layer, and these should be called representations… (but I’m not a big fan of terminology because different people use different or same words to communicate), so your picture should look more like:

“It’s” ----tokenized-----> [54] ----embedded----> [0.1, 0.3, 0.5, 07] (and this vector would be called embeddings).

Now the first LSTM (and the second) are missing arrows from left sides (each rectangle receives previous “hidden states”), for the first state for example that would be [0, 0, 0, 0]

The top rectangles would be the representations of each word. (I cannot draw you a scheme right now but your second picture in the previous post pretty much is a good representation of what happens (we don’t use the softmax in the encoder)).

You might find this thread (with actual number values) helpful.

Fei_Li · April 19, 2023, 12:20am

sure sure. Thank you so much. I am more clear on LSTM and this line of code.

Peixi_Zhu · August 10, 2023, 9:15pm

Hi, @arvyzukai

I have some follow-up questions after reading this post.

“Now the first LSTM (and the second) are missing arrows from left sides (each rectangle receives previous “hidden states”), for the first state for example that would be [0, 0, 0, 0]”

Why is left-side input for first layer [0, 0, 0, 0]? What is left-side input for 2nd layer? Is it also [0, 0, 0, 0]?

What is the output dimension for one of the rectangles in the last image above? My understanding is that this dimension is specified by the d-model param, correct?
Does the output dimension for each rectangle need to be aligned with the embedding dimension? My understanding is no but for parallel computing we still want the number to be 2^n. Is that correct?
Does the embedding need to specified as a param for building LSTM model?

arvyzukai · August 14, 2023, 4:50pm

Hi @Peixi_Zhu

Yes, that would be the same for 2nd layer.

You are correct.

Not necessarily, but often the case (at the end, what matters is performance and usually after playing with hyper parameters the embedding and d-model are the same).

I’m not sure I understand your question correctly:

the embedding layer is not mandatory - you can implement an LSTM model without it.
if you have an embedding layer, you need to specify how “deep” is the embedding - the embedding dimension - how many numbers does every token use to represent some meaning (or how many dimensions are there on the graph to represent every token (most commonly for word embedding 50 to 300, meaning each word meaning can be represented with 300 numbers)).
In either case, that is not specific to LSTM models but in general.

Cheers

Topic		Replies	Views
Why multiple LTSM layers for encoder and only one LSTM for decoder? NLP with Attention Models week-module-1	1	565	September 21, 2022
Support with C4W1 assignment - NLP with attention models NLP with Attention Models feedback , week-module-1	3	229	June 19, 2025
LSTM of size "d_model" NLP with Sequence Models week-module-4	1	500	February 28, 2023
Course 5 Week 4 A1 transformer subclass Generative AI with Large Language Models week-module-4	3	175	May 8, 2024
C4W1: Quick question - Number of LSTM units in the model NLP with Attention Models week-module-1	1	417	March 2, 2024

LSTM and encoder_layers

feed the embeddings to the LSTM layers. It is a stack of n_encoder_layers LSTM layers

Related topics