LSTM and encoder_layers

Hello, I have trouble understanding this piece of code in the assignment, specifically, I don’t quite get the variable meaning for d_model and n_encoder_layers. also, why use a for loop here?

feed the embeddings to the LSTM layers. It is a stack of n_encoder_layers LSTM layers

    [tl.LSTM(d_model) for _ in range(n_encoder_layers)]

this piece of code comes from

(Solution code removed, as posting it publicly is against the honour code of this community, regardless if it is correct or not)

If I take the exam from the lecture:

say, each of the input words, It’s time for tea. is d_model dimensional, n_ecoder_layers here would be 4? since there are four words?

then why the hint says It is a stack of n_encoder_layers LSTM layers? I guess I have some misunderstandings here.

also how do we stack LSTM? and how do we stack them? I searched on the internet for a little bit, is it something like the following?

Hi @Fei_Li

In can offer an explanation with simple words: d_model is how many number each vector has (how many numbers is enough to represent each word/state). So in your pictures, the arrows would carry a vector of 4 numbers in your example (like [0.1, -0.3, 0.5, 0.8]).

As for the second layer, the bottom picture visualize what happens - the bottom layer passes the vector (modified) to the upper layer (the upper layer does not receive the embed layer number, but the layer from below).

Yes, you have a common misunderstanding - the number of words in this case has nothing to do with d_model or n_encoder_layers.

The for loop is just to propagate the list. For example, in this case it could have been:
[tl.LSTM(d_model), tl.LSTM(d_model)] and that would have been equivalent, but for a more general you would use the list comprehension as in the exercise.

In code, just like in your example. Generally for a better performance - more layers tend to have better performance (when the dataset is complex enough, more layers tend to better capture the complexity).

If you have further questions feel free to ask.

Thank you so much for your help. I think I get it.
So [tl.LSTM(d_model) for _ in range(n_encoder_layers)], this code gets embeddings for each word (It’s time for tea, each word is represented by a vector of four as your example shows), then sends to Lstm, then another Lstm, then another…(there are n_encouder_layers lstm layers stacked together for each word)

I draw my thoughts in the pic below.

I think I got what you mean. Am I right?

Well, to be more precise:

technically speaking embeddings are after the embedding layer, and these should be called representations… (but I’m not a big fan of terminology because different people use different or same words to communicate), so your picture should look more like:

“It’s” ----tokenized-----> [54] ----embedded----> [0.1, 0.3, 0.5, 07] (and this vector would be called embeddings).

Now the first LSTM (and the second) are missing arrows from left sides (each rectangle receives previous “hidden states”), for the first state for example that would be [0, 0, 0, 0]

The top rectangles would be the representations of each word. (I cannot draw you a scheme right now but your second picture in the previous post pretty much is a good representation of what happens (we don’t use the softmax in the encoder)).

You might find this thread (with actual number values) helpful.

sure sure. Thank you so much. I am more clear on LSTM and this line of code.

Hi, @arvyzukai

I have some follow-up questions after reading this post.

  1. “Now the first LSTM (and the second) are missing arrows from left sides (each rectangle receives previous “hidden states”), for the first state for example that would be [0, 0, 0, 0]”

Why is left-side input for first layer [0, 0, 0, 0]? What is left-side input for 2nd layer? Is it also [0, 0, 0, 0]?

  1. What is the output dimension for one of the rectangles in the last image above? My understanding is that this dimension is specified by the d-model param, correct?

  2. Does the output dimension for each rectangle need to be aligned with the embedding dimension? My understanding is no but for parallel computing we still want the number to be 2^n. Is that correct?

  3. Does the embedding need to specified as a param for building LSTM model?

Hi @Peixi_Zhu

Yes, that would be the same for 2nd layer.

You are correct.

Not necessarily, but often the case (at the end, what matters is performance and usually after playing with hyper parameters the embedding and d-model are the same).

I’m not sure I understand your question correctly:

  1. the embedding layer is not mandatory - you can implement an LSTM model without it.
  2. if you have an embedding layer, you need to specify how “deep” is the embedding - the embedding dimension - how many numbers does every token use to represent some meaning (or how many dimensions are there on the graph to represent every token (most commonly for word embedding 50 to 300, meaning each word meaning can be represented with 300 numbers)).
    In either case, that is not specific to LSTM models but in general.