It should be y_2 = y_1 + FeedFwd(y_1), isn’t it?

Hi @larryleguo

What you are talking about is the standard Transformer. In standard Transformer, you use y_1 for the FeedForward but in Reformer you make use of reversible residual connections - you use a copy of x (x_2).

So for standard Transformer FeedForward is calculated:

y_b = y_a + G(y_a)

and for the Reformer the FeedForward is calculated:

y_2 = x_2 + G(y_1)

Then they are not equivalent? What is the relation between x_2 and y_1?

You are correct - they are not equivalent. Maybe this picture will clarify things:

As you can see y_1 = x_1 + F(x_2) (here x_1 has the same values as x_2, and F is the Attention function, so in other words you could say that y_1 is x + F(x) as in standard transformer (where y_a = x + F(x) or that y_a would have the same values as y_1.

y_2 on the other hand, would have *totally different* values than y_b or y_a and y_1 (which are the same). As you probably can see y_2 = x_2 + G(y_1) (here G is FeedForward function on y_1 values and added to x_2 is the same as x_1 )

Hey @arvyzukai,

In the Reading entitled “Reversible Residual Layers”, the below statement is mentioned:

Now you don’t have to store the

weights, because you can just compute them from scratch.

If I am not wrong, it should be “activations” instead of “weights”, right?

Cheers,

Elemento

Hey @Elemento

I believe you’re right - it should be intermediate activations or intermediate inputs. I’m not friends with terminology

Cheers

Thanks a lot @arvyzukai Let me create an issue for this, so that we can try to have a better friendship with the terminology.

Cheers

Hi, @arvyzukai

Are the standard residual and reversible basically 2 different model architectures, i.e., the model weights will be completely different even if inputs and random seeds are exactly the same?

How does the reversible residual net help with saving the memory? We said we don’t want to store the activations but by creating a new copy we are basically storing the activations.

Hi @Peixi_Zhu

The architectures are different but in theory the models weights would not be different (if not for some practical details). Reconstructing the activations over many

layers causes numerical errors to accumulate, but in practice they are small in magnitude.

In short, we win by not saving the FeedForward activations which are multiple times higher than the d_model (d_model = 1024, d_ff = 4096). Having two copies is of 1024 is less than 4096. But we flip side is that we recompute them (memory vs. computation dilemma).

I would advise to read the Reformer paper - it’s very approachable and explains the details well.

Cheers