Reversible Transformer

Mrplants · September 17, 2023, 3:21am

In the final videos of this week, Younes claims that the Reformer, using LSH and reversible layers, is able to ingest context lengths of 1M+ tokens and still perform equivalently well to a traditional transformer. If this is the case, why do we continue to see state of the art models (Llama-2, GPT-3, Palm2, Falcon180B, etc.) that use standard transformers and are limited to significantly shorter context lengths (4k and 16k are common).

Why don’t state of the art models use the reversible transfomer architecture?

arvyzukai · September 19, 2023, 5:26am

Hi @Mrplants

That is a good question and the answer depends on case by case. Not all applications require such long context lengths, and using a simpler architecture with lower resource demands can be more practical and cost-effective. Intuitively, people’s brain probably also do not use 1M+ tokens context but try to compress and store the information for later use. In practice however, the extrinsic performance (accuracy on your particular task, or other metric) is what’s most important and if your application does not require 1M+ tokens to achieve the task, then such long context might me a burden (overfitting due to spurious correlations, the lack of signal in distant tokens).

Cheers

Topic		Replies	Views
What about Reformer, can this architecture help? Generative AI with Large Language Models week-1	4	425	October 24, 2023
Error in practice quiz question 10 NLP with Attention Models week-4	1	485	January 23, 2023
Not understanding how Reversible Layers in the Reformer saves memory NLP with Attention Models week-4	3	362	September 4, 2023
Reversible Residual Layers: Cannot understand y_2 = x_2 + FeedFwd(y_1) NLP with Attention Models week-4	8	530	September 4, 2023
Transformers architecture - Week 1 \| Coursera Generative AI with Large Language Models week-1	1	1046	December 2, 2023

Reversible Transformer

Related topics