In the lecture and slides, what I heard and read are that seq2seq model uses LSTM and GRU to avoid vanishing/exploding gradient problems. But in the quiz question #1,

Which of the following are bottlenecks when implementing seq2seq models?

It looks like there are still vanishing/exploding gradient problems?

Hi @larryleguo

Yes, there still is. The LSTM and GRU help with vanishing/exploding gradient problems (compared to vanilla RNN) but these problems still persist because of the model architecture:

Recurrent models typically take in a sequence in the order it is written and use that to output a sequence. Each element in the sequence is associated with its step in computation time t. (i.e. if a word is in the third element, it will be computed at t_3). These models generate a sequence of hidden states h_t, as a function of the previous hidden state h_{t-1} and the input for position t.

In other words you still need to cramp up all the information into the last step t hidden state (you cannot lookup “three words back” whenever you want, because all you have is the last state (or two states in LSTM case) and your input, and from this you have to make a prediction).

1 Like

@arvyzukai this is exactly what I thought, “**channeling the information**” in GRUs and LSTMs would improve model performance but not necessarily eliminate the vanishing/exploding gradient issue.

Thank you for this much needed clarification!

~ Ani