Bidirectional vs vanilla LSTM

Can you expand on the following comment in the assignment:
Pre-attention [LSTM]: Unlike in the encoder in which you used a Bidirectional LSTM, here you will use a vanilla LSTM. Why? That is why we were using Bidirectional LSTM in the encoder and a vanilla one in the decoder? What dictates that choice?

Well, think about what the two layers are actually doing. The Encoder is analyzing the input and figuring out the relationships between all the elements. Say it’s a sentence, so it’s taking the word embeddings of all the words and calculating all the attention values for the relationships between all the words in the sentence. So in that case, it helps to see everything “at once” instead of just serially one step at a time: later words may have a relationship and effect on the earlier words. But then think about what the Decoder is going to do: it takes the understanding created by the Encoder and then it outputs a new sequence that expresses the output of your model. Say you’re building an English to French translator. Now you’ve got the encoded meaning of the English sentence as input and you want to output the French sentence. That you can do in a serial fashion one word at a time.

2 Likes

@paulinpaloalto Thank you. That’s exactly the intuition that I was looking for. In the same vein, what would the circumstances (type of task, corpus, etc) where I would want to use bidirectional for both the encoder and decoder? When I would want I a vanilla on the encoder side and bidirectional on the decoder side?
Or GRU and bi-GRU?
Also, is there a good reference for that?

Hi @Lenny_Tevlin

I realized the discussion was about Week 1 and the Bahdanau paper where the concept of Attention was firstly introduced.
Without a causal mask (teacher forcing/look-ahead mast/ other names for it) used in Attention is All You Need paper, the decoder has to be unidirectional, otherwise training without it (or other modifications) would not work.

I would suggest Dive Into Deep Learning free online book, Chapter on Modern Recurrent Networks for more information on this topic, and also Bahdanau attention (and other Attention mechanisms) related to this topic.

Cheers

1 Like

Thank you so much for your explanation, @arvyzukai

1 Like

@arvyzukai Not a conceptual, but a technical question:
why do we need line 19 in generate_next_token function (section 4: using model for inference)
logits = logits[:, -1, :] ?
we are doing squeeze in lines 31 and 32 in any case.

1 Like

Hi @Lenny_Tevlin

What this achieves it takes the last models output in the sequence (since shape is (batch_size, sequence_length, probabilities). In other words, - we get the log probabilities of the predicted “next token”.

We are removing the batch dimension, which in this case is 1, since we input only the single sentence. That’s the idea for the squeeze, I think.

But thanks to your question, I believe I spotted an issue with the next Exercise 5, translate() function, where the generate_next_token function is fed with a single token every time (next_token is always a single token).
I will report the issue.

Cheers

1 Like