Not understanding how Reversible Layers in the Reformer saves memory

Hi @Joel_Wigton

That is a good question and the answer lies in the FeedForward layer - its dimensionality is multiple times higher than the d_model (d_model = 1024, d_ff = 4096). In expense of compute we can save the memory (about 1.5-2× more expensive on compute, depending on the implementation).

You should read the Reformer paper - it is very approachable and explains the ideas (both, the parameter reuse and also chunking for memory savings) very well.
Also, you could check out The Reversible Residual Network paper for more details.

Cheers