What's the point of an RNN encoder in seq2seq models with attention?

Max_Rivera · June 6, 2022, 5:26pm

I don’t really understand why it makes sense to have an RNN encoder in the seq2seq models with attention… Like I understood the purpose of the encoder when we weren’t using the attention mechanism - it’s so we can condense all the inputs into a single context vector. But now with attention, we need the input words at every timestep, what is the point of the encoder RNN? Aren’t the inputs already vectorized because of word2vec?

Max_Rivera · June 7, 2022, 1:18am

And my guess, assuming that the inputs are already word2vec embeddings, is that the RNN transforms the word2vec embeddings into a more useful representation of each word based on words in that same sentence that came previously in time…

Elemento · June 7, 2022, 5:47am

Hey @Max_Rivera,
I guess we are leaving behind the fact as to why the Attention model was introduced in the first place. In the case of long sequences, encoding the entire sentence into a single context vector and then decoding, missed out on important information. And therefore, we introduced the attention model, which assigned the apt attention to each of the inputs for each of the outputs. Here, note that the motive behind the attention model was to assign attention, and not to extract feature representations from the word embeddings, something which is still the job of a RNN Encoder.

And hence, the use case of RNN Encoder. So, in my opinion, you are partially correct, since we definitely need more useful representations than just word embeddings. In the next week, you will be learning about Transformers, which exploits the attention mechanism without a RNN Encoder, but still it uses an encoder, although a different one. I hope this helps.

P.S. - The RNN encoder can be a bi-directional RNN too, in which case, it will be encoding information from both the words that occurred before and after the current word.

Regards,
Elemento

thearkamitra · June 7, 2022, 6:25am

Hi @Max_Rivera,

Historically, RNNs were used first for seq2seq. But for long sequences, they suffered from the issue of initial words being forgotten.
Thereby the attention came into place and it helped mitigate the issue.

In the paper “Attention Is All You Need”, they used attention (transformer) to train the whole system. After that, large language models stacked transformers and are currently the state of the art.

Hope this helps,
Arka

Topic		Replies	Views
Video: NMT Model with Attention NLP with Attention Models week-1	5	381	December 21, 2023
Why do we need the pre-attention decoder? NLP with Attention Models week-1	8	630	October 11, 2023
Limitation of seq2seq without attention Sequence Models coursera-platform	2	680	June 5, 2022
Bidirectional vs vanilla LSTM NLP with Attention Models week-1	6	354	March 4, 2024
Difference between RNN and seq2seq NLP with Attention Models week-2	1	997	July 21, 2023

What's the point of an RNN encoder in seq2seq models with attention?

Related topics