Layer order in Residual block of UNQ_C6

A bit confused about the layer ordering in UNQ_C6.

In the Residual block according to the image, we first do the attention layer and then normalize it and add it the initial values with the Residual block. Same for feedforward. First layers β†’ normalize. Does normalization happen before or after the summation by Residual block?
image

In the assignment, however, we do the opposite order. First, normalize then do the rest of the layers. Is one of them wrong or am I missing something?

    return [
      tl.Residual(
          # Normalize layer input
          tl.LayerNorm(),
          # Add causal attention block previously defined (without parentheses)
          causal_attention,
          # Add dropout with rate and mode specified
          tl.Dropout(rate = dropout, mode = mode)
        ),
      tl.Residual(
          # Add feed forward block (without parentheses)
          feed_forward
        ),
      ]

1 Like

Hi Mazatov,

Good question. I am not sure why this is done this way in the assignment. I will make a note of this to people working at the backend.

Thanks!

I have the same question. Is there any update on this post? Thanks.

Andrej Karpathy mentions in his video about GPT that nowadays it’s more common to apply the normalization layer before the attention layer. The link points directly to the relevant moment of the video. Let's build GPT: from scratch, in code, spelled out. - YouTube

1 Like