In the final videos of this week, Younes claims that the Reformer, using LSH and reversible layers, is able to ingest context lengths of 1M+ tokens and still perform equivalently well to a traditional transformer. If this is the case, why do we continue to see state of the art models (Llama-2, GPT-3, Palm2, Falcon180B, etc.) that use standard transformers and are limited to significantly shorter context lengths (4k and 16k are common).
Why don’t state of the art models use the reversible transfomer architecture?
Hi @Mrplants
That is a good question and the answer depends on case by case. Not all applications require such long context lengths, and using a simpler architecture with lower resource demands can be more practical and cost-effective. Intuitively, people’s brain probably also do not use 1M+ tokens context but try to compress and store the information for later use. In practice however, the extrinsic performance (accuracy on your particular task, or other metric) is what’s most important and if your application does not require 1M+ tokens to achieve the task, then such long context might me a burden (overfitting due to spurious correlations, the lack of signal in distant tokens).
Cheers