W4 Assignment-Exercise6; why shape after second Add&Norm Layer is (batch_size, n_target, full_connected_dim) not (batch_size, n_target, d_model)?

So I have two questions.
Q1:
In Week 4 assignment ‘Transformer Architecture’, In Exercise 6, I understand that the shape after second MultiHead Attention is (batch_size, n_target, d_model). But how possibly the shape can be changed to (batch_size, n_target, full_connected_dim) after ‘Add&Norm’ Layer?

The code I’m referring to is this:

mult_attn_out2, attn_weights_block2 = self.mha2(Q1, enc_output, enc_output, padding_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
        
 # apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
mult_attn_out2 = self.layernorm2(mult_attn_out2 + Q1)  # (batch_size, target_seq_len, fully_connected_dim)

Q2:
and also in class EncoderLayer(tf.keras.layers.Layer): the shape of output of Encoder is said to be (batch_size, target_seq_len, d_model), but in the doc string of class DecoderLayer(tf.keras.layers.Layer), the shape of enc_output is said to be (batch_size, n_target, full_connected_dim) , why is that?

Thanks

Thanks for bringing this up.
The staff have been notified to fix this for the following reasons:

  1. Stacking encoder / decoder layers doesn’t make sense if the final dimension of inputs and outputs don’t match.
  2. def FullyConnected eventually emits embedding_dim across the last dimension.
  3. Layer normalization will not change output shape of the provided input.
1 Like

Thanks for pointing about this error. That was a good catch. I confirm, that the size of last dimension of the EnconderLayer is embedding_dim. Somehow we mess up, because we said this: x -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim), which is the source of the confusion. The size of x is (batch_size, input_seq_len, embedding_dim)

I’ll change the documentation and the comments accordingly.

1 Like