Bidirectional vs vanilla LSTM

Lenny_Tevlin · February 25, 2024, 9:38pm

Can you expand on the following comment in the assignment:
Pre-attention [LSTM]: Unlike in the encoder in which you used a Bidirectional LSTM, here you will use a vanilla LSTM. Why? That is why we were using Bidirectional LSTM in the encoder and a vanilla one in the decoder? What dictates that choice?

paulinpaloalto · February 25, 2024, 11:40pm

Well, think about what the two layers are actually doing. The Encoder is analyzing the input and figuring out the relationships between all the elements. Say it’s a sentence, so it’s taking the word embeddings of all the words and calculating all the attention values for the relationships between all the words in the sentence. So in that case, it helps to see everything “at once” instead of just serially one step at a time: later words may have a relationship and effect on the earlier words. But then think about what the Decoder is going to do: it takes the understanding created by the Encoder and then it outputs a new sequence that expresses the output of your model. Say you’re building an English to French translator. Now you’ve got the encoded meaning of the English sentence as input and you want to output the French sentence. That you can do in a serial fashion one word at a time.

Lenny_Tevlin · February 26, 2024, 7:17am

@paulinpaloalto Thank you. That’s exactly the intuition that I was looking for. In the same vein, what would the circumstances (type of task, corpus, etc) where I would want to use bidirectional for both the encoder and decoder? When I would want I a vanilla on the encoder side and bidirectional on the decoder side?
Or GRU and bi-GRU?
Also, is there a good reference for that?

arvyzukai · February 26, 2024, 1:37pm

Hi @Lenny_Tevlin

I realized the discussion was about Week 1 and the Bahdanau paper where the concept of Attention was firstly introduced.
Without a causal mask (teacher forcing/look-ahead mast/ other names for it) used in Attention is All You Need paper, the decoder has to be unidirectional, otherwise training without it (or other modifications) would not work.

I would suggest Dive Into Deep Learning free online book, Chapter on Modern Recurrent Networks for more information on this topic, and also Bahdanau attention (and other Attention mechanisms) related to this topic.

Cheers

Lenny_Tevlin · February 26, 2024, 11:01pm

Thank you so much for your explanation, @arvyzukai

Lenny_Tevlin · February 29, 2024, 1:40pm

@arvyzukai Not a conceptual, but a technical question:
why do we need line 19 in generate_next_token function (section 4: using model for inference)
logits = logits[:, -1, :] ?
we are doing squeeze in lines 31 and 32 in any case.

arvyzukai · March 4, 2024, 6:09am

Hi @Lenny_Tevlin

What this achieves it takes the last models output in the sequence (since shape is (batch_size, sequence_length, probabilities). In other words, - we get the log probabilities of the predicted “next token”.

We are removing the batch dimension, which in this case is 1, since we input only the single sentence. That’s the idea for the squeeze, I think.

But thanks to your question, I believe I spotted an issue with the next Exercise 5, translate() function, where the generate_next_token function is fed with a single token every time (next_token is always a single token).
I will report the issue.

Cheers

Topic		Replies	Views
Question about the use of Bidirectional LSTM for Text Generation Natural Language Processing in TensorFlow week-module-4	8	351	October 18, 2024
Why use Bi-directional LSTM in encoder and not within Pre-attention decoder NLP with Sequence Models week-module-1	1	50	November 17, 2024
LSTM vs bidirectional Sequence Models week-module-2 , coursera-platform	3	28	February 4, 2025
Video: NMT Model with Attention NLP with Attention Models week-module-1	5	382	December 21, 2023
Two tiny questions in W3A1 Sequence Models coursera-platform	4	247	December 27, 2023

Bidirectional vs vanilla LSTM

Related topics