Transformer Encoder Block tl.Mean

In TransformerEncoder function of the assignment, why adding tl.Mean(axis=1) after the encoder_blocks? I think axis=1 here refers to the n_sequence dimension. Why doing so?

Yes, you are correct that axis=1 refers to sequence dimension.

We do this similarly as we did with RNNs - in order for the encoder to pass meaningful representations of the sequence for the decoder. This will help the decoder focus on the appropriate words in the input during the decoding process (which takes it’s own input and this encoder representations (output of the sequence) to generate it’s own output).

Theoretically you could do whatever you want with the encoder output, but this output must match decoder’s expected input from the encoder.

1 Like

Can you explain this in a bit detail @arvyzukai ?

Hi @Aaditya1

To be fair, I don’t quite remember the context of this question (what kind of decoder I was talking about). I glanced at overview of this (C4 W3) Assignment and it states:

This assignment will be different from the two previous ones. Due to memory and time constraints of this environment you will not be able to train a model and use it for inference. Instead you will create the necessary building blocks for the transformer encoder model and will use a pretrained version of the same model in two ungraded labs after this assignment.

I glanced at C4 W4 ungraded labs and I could not find any signs of this encoder (maybe the course changed?) But maybe I have to spend more time looking into these labs but unfortunately I cannot do this right now.

In any case, looking at the code we can see that there’s an example of inputs:

<Z>il plaid lycra and span <Y>ex shortall with metallic slink <X>
inset <W>. Attached metallic elastic belt with O <V>ring. Headband
included. <U> hip <T> jazz dance costume.<S> in the USA.

and targets:
<Z> Fo <Y>d <X>y <W>s <V>- <U> Great <T> hop or<S> Made

Main points:

  1. The batch normally is created of many these inputs and targets and it’s usually on dimension 0.
  2. The inputs in this example is probably ~40 tokens - the sequence length (the OP’s question).
  3. The final output of this model is a toy output of 10 numbers (n_classes). Not sure why it was chosen but I think it was just random number for learning purposes and these 10 classes do not represent anything.

So, all in all, the process of getting a prediction goes something like this:

  • input of shape (batch_size, seq_len_padded), for example (64, 300);
  • after embedding (batch_size, seq_len_padded, embedding_dim), for example (64, 300, 512);
  • after encoder blocks (batch_size, seq_len_padded, d_model), for example (64, 300, 512);
  • after tl.Mean(axis=1) (batch_size, d_model), for example (64, 512);
  • after toy Dense (batch_size, n_classes), for example (64, 10).

So the OP’s question was that from (64, 300, 512) we go to (64, 512).

If you have specific question, please feel free to ask.


Actually I couldn’t understand why are we using mean to average on the sequence length axis specifically? Is it to represent that sequence by the average embedding of all the tokens?

You are correct. It is not very clear why they chose this approach (maybe due to Dense layer at the end), but it is one of the ways you can represent the whole sequence - by averaging its embeddings (though you loose important information). It was also done in previous weeks when working with RNNs.