From what I understood from the course, RNN like models (RNNs, LSTMs, GRUs) store information in hidden states passed between timesteps. The information in the hidden states decays over timesteps, but this is gradual.
However, it seems that transformer decoders have a fixed input sequence length and the rest of the input must be padded. Does this not mean that after a certain number of tokens are generated in the output, the input window shifts to the right and previous words are no longer input to the model? Unlike RNNs, is this not a hard cutoff to the information input to the model, since it can no longer even see, much less retain information about previous inputs?
And if so, how do these models remember their own outputs many time steps ago? Is is through a large context window?
From my point of view, the contrast between recurrent neural networks (RNNs) and transformer decoders becomes apparent in how they manage information. In RNNs, information is retained in hidden states passed across timesteps, with a gradual decay over time. In transformers, the decoder deals with fixed-length input sequences, potentially leading to a loss of earlier input tokens as the context window shifts during generation.
In transformers, the concern of losing access to prior tokens is mitigated through the mechanism of self-attention. This mechanism enables each position in the input sequence to consider all positions when calculating its output representation. This capability permits the model to maintain information from earlier tokens, even as the context window evolves.
To remember past outputs, especially in autoregressive generation scenarios, transformers employ “position embeddings.” These embeddings are added to input embeddings to convey the token’s position in the sequence. By integrating position embeddings, transformers can learn to generate outputs that rely on their own previous outputs. This effectively preserves information from earlier time steps.
In summary, transformers address the issue of limited context window through self-attention and sustain information from distant tokens using position embeddings, enabling them to model long-range dependencies and retain memory of past outputs.
Hi @elirod, thanks for the explanation, although I still have a few questions.
You said that the mechanism of self attention allows the model to to consider all input elements, but what if we consider a decoder-only architecture, like the one used for the GPT models? Since there is no encoder and the model’s inputs are it’s own previous outputs, as we generate more outputs, the context window will shift right, and the model will essentially forget what it had generated in the past. Is this not a problem?
Also, regarding position encoding, it remains unclear to me how simply adding a value to your embedding vectors allows the model to retain information of previous outputs. Won’t that allow the model to indirectly infer it’s position in the sequence, while still revealing nothing about previous outputs?
Well, when considering decoder-only architectures, a concern arises regarding the shifting context window. As more outputs are generated, the model’s ability to recall earlier generated tokens might weaken. However, strategies are in place to alleviate this:
GPT models have multiple layers of self-attention, enabling them to retain knowledge about a substantial portion of the input sequence even as the context window moves.
Position embeddings and self-attention together address the problem of the model losing track of its previous outputs. The position embeddings allow the model to differentiate between positions, while self-attention enables it to associate current outputs with prior ones.
Regarding position embeddings, your apprehension is valid. Position embeddings by themselves do not directly maintain information about previous outputs. Instead, their purpose is to differentiate tokens based on position. The key is that the self-attention mechanism, in conjunction with position embeddings, enables the model to incorporate information from various parts of the input sequence, including its own prior outputs. This interaction helps the model capture dependencies between tokens at different positions and effectively retain information from previous outputs.