What's the point of an RNN encoder in seq2seq models with attention?

I don’t really understand why it makes sense to have an RNN encoder in the seq2seq models with attention… Like I understood the purpose of the encoder when we weren’t using the attention mechanism - it’s so we can condense all the inputs into a single context vector. But now with attention, we need the input words at every timestep, what is the point of the encoder RNN? Aren’t the inputs already vectorized because of word2vec?

And my guess, assuming that the inputs are already word2vec embeddings, is that the RNN transforms the word2vec embeddings into a more useful representation of each word based on words in that same sentence that came previously in time…

Hey @Max_Rivera,
I guess we are leaving behind the fact as to why the Attention model was introduced in the first place. In the case of long sequences, encoding the entire sentence into a single context vector and then decoding, missed out on important information. And therefore, we introduced the attention model, which assigned the apt attention to each of the inputs for each of the outputs. Here, note that the motive behind the attention model was to assign attention, and not to extract feature representations from the word embeddings, something which is still the job of a RNN Encoder.

And hence, the use case of RNN Encoder. So, in my opinion, you are partially correct, since we definitely need more useful representations than just word embeddings. In the next week, you will be learning about Transformers, which exploits the attention mechanism without a RNN Encoder, but still it uses an encoder, although a different one. I hope this helps.

P.S. - The RNN encoder can be a bi-directional RNN too, in which case, it will be encoding information from both the words that occurred before and after the current word.


Hi @Max_Rivera,

Historically, RNNs were used first for seq2seq. But for long sequences, they suffered from the issue of initial words being forgotten.
Thereby the attention came into place and it helped mitigate the issue.

In the paper “Attention Is All You Need”, they used attention (transformer) to train the whole system. After that, large language models stacked transformers and are currently the state of the art.

Hope this helps,