Reversible Residual Layers: Cannot understand y_2 = x_2 + FeedFwd(y_1)

larryleguo · September 17, 2022, 12:01am

It should be y_2 = y_1 + FeedFwd(y_1), isn’t it?

arvyzukai · September 19, 2022, 5:49am

What you are talking about is the standard Transformer. In standard Transformer, you use y_1 for the FeedForward but in Reformer you make use of reversible residual connections - you use a copy of x (x_2).

So for standard Transformer FeedForward is calculated:
y_b = y_a + G(y_a)
and for the Reformer the FeedForward is calculated:
y_2 = x_2 + G(y_1)

larryleguo · September 19, 2022, 3:50pm

Then they are not equivalent? What is the relation between x_2 and y_1?

arvyzukai · September 19, 2022, 4:05pm

You are correct - they are not equivalent. Maybe this picture will clarify things:

As you can see y_1 = x_1 + F(x_2) (here x_1 has the same values as x_2, and F is the Attention function, so in other words you could say that y_1 is x + F(x) as in standard transformer (where y_a = x + F(x) or that y_a would have the same values as y_1.

y_2 on the other hand, would have totally different values than y_b or y_a and y_1 (which are the same). As you probably can see y_2 = x_2 + G(y_1) (here G is FeedForward function on y_1 values and added to x_2 is the same as x_1 )

Elemento · January 23, 2023, 6:21pm

Hey @arvyzukai,
In the Reading entitled “Reversible Residual Layers”, the below statement is mentioned:

Now you don’t have to store the weights, because you can just compute them from scratch.

If I am not wrong, it should be “activations” instead of “weights”, right?

Cheers,
Elemento

arvyzukai · January 23, 2023, 7:12pm

Hey @Elemento

I believe you’re right - it should be intermediate activations or intermediate inputs. I’m not friends with terminology

Cheers

Elemento · January 24, 2023, 11:51am

Thanks a lot @arvyzukai Let me create an issue for this, so that we can try to have a better friendship with the terminology.

Cheers

Peixi_Zhu · August 28, 2023, 1:55am

Hi, @arvyzukai

Are the standard residual and reversible basically 2 different model architectures, i.e., the model weights will be completely different even if inputs and random seeds are exactly the same?

How does the reversible residual net help with saving the memory? We said we don’t want to store the activations but by creating a new copy we are basically storing the activations.

arvyzukai · September 4, 2023, 8:20am

Hi @Peixi_Zhu

The architectures are different but in theory the models weights would not be different (if not for some practical details). Reconstructing the activations over many
layers causes numerical errors to accumulate, but in practice they are small in magnitude.

In short, we win by not saving the FeedForward activations which are multiple times higher than the d_model (d_model = 1024, d_ff = 4096). Having two copies is of 1024 is less than 4096. But we flip side is that we recompute them (memory vs. computation dilemma).
I would advise to read the Reformer paper - it’s very approachable and explains the details well.

Cheers

Topic		Replies	Views
Not understanding how Reversible Layers in the Reformer saves memory NLP with Attention Models week-4	3	362	September 4, 2023
Question About reversible residual layers NLP with Attention Models week-4	1	310	November 19, 2023
Reversible Transformer NLP with Attention Models week-4	1	395	September 19, 2023
C1- W4 exercise 9 Neural Networks and Deep Learning	16	488	May 25, 2023
Course 5 Week 4 A1 transformer subclass v1 Sequence Models week-4	5	292	February 17, 2024

Reversible Residual Layers: Cannot understand y_2 = x_2 + FeedFwd(y_1)

Related topics