Not understanding how Reversible Layers in the Reformer saves memory

arvyzukai · September 4, 2023, 8:13am

That is a good question and the answer lies in the FeedForward layer - its dimensionality is multiple times higher than the d_model (d_model = 1024, d_ff = 4096). In expense of compute we can save the memory (about 1.5-2× more expensive on compute, depending on the implementation).

You should read the Reformer paper - it is very approachable and explains the ideas (both, the parameter reuse and also chunking for memory savings) very well.
Also, you could check out The Reversible Residual Network paper for more details.

Cheers

Topic		Replies	Views
Reversible Residual Layers: Cannot understand y_2 = x_2 + FeedFwd(y_1) NLP with Attention Models week-module-4	8	557	September 4, 2023
Question About reversible residual layers NLP with Attention Models week-module-4	1	323	November 19, 2023
Reversible Transformer NLP with Attention Models week-module-4	1	411	September 19, 2023
Computational challenges of training LLMs Generative AI with Large Language Models week-module-1	1	390	July 22, 2024
Where is the LSHAttention model? NLP with Attention Models week-module-4	3	507	March 27, 2023

Not understanding how Reversible Layers in the Reformer saves memory

Related topics