Limitation of seq2seq without attention

Why exactly do RNN seq2seq models that don’t use attention have trouble with longer sequences? Would it be accurate to say there is a loss of information when trying to encode a very long input sequence into a fixed length context vector?

Hey @Max_Rivera,
According to me, there are 2 sources of loss. The first as you just mentioned, if we are encoding the input as n-length vector, and if the length of the input is greater than n. However, you will find that we can simply avoid it to a great extent by plotting the distribution of the length of the input vectors, and then choosing n such that most of the inputs can be encoded without any loss, for instance, n greater than the length of 95% of the vectors.

The second reason, which is of major importance here is the Vanishing Gradients. Prof Andrew told about this in a lecture video in the first week, entitled “Vanishing Gradients with RNNs”. With the help of more advanced networks like GRUs and LSTMs, we can propagate the inputs to a much longer extent, thereby reducing the extent of vanishing gradients.

A vague analogy of CNNs comes to my mind. In the 4th course, you must have seen that we faced the same issue of vanishing gradients with deep conv-nets, and thus, we introduced ResNets, i.e., CNNs with residual connections in order to reduce the effect of vanishing gradients. If you think about the structure of either a GRU or a LSTM, you will also find additional connections between the subsequent cells in a single layer, governed by additional learnable parameters, and these can simply be thought of as residual connections with the sole difference of additional parameters. This is just one of the things that comes to my mind. It may be wrong or confusing, so I would advise you not to rely on it too strongly. I hope this helps.

Regards,
Elemento

I think Elemento pointed out an important point, vanishing gradients. That’s an important aspect.

If we back to your original question, I think answers should be in several papers that tackled the performance degradation for longer sentences.

Here is one example. NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho et al. (2014b) showed that indeed the performance of a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.

As you pointed out, the output from an encoder is a fixed-length context information from the last layer. It is quite difficult to keep old information in here if a sentence is getting longer. The paper above is one of example. Most of approaches are “how to retrieve old information from an encoder network in order to feed into a decoder network”. In this sense, sometimes, those models are called as “Peeky” seq2seq models.

So, at least, in the academic society, your statement is correct. But, most of them already went to “attention” related works or beyond. :slight_smile: