So I have two questions.
Q1:
In Week 4 assignment ‘Transformer Architecture’, In Exercise 6, I understand that the shape after second MultiHead Attention is (batch_size, n_target, d_model). But how possibly the shape can be changed to (batch_size, n_target, full_connected_dim) after ‘Add&Norm’ Layer?
The code I’m referring to is this:
mult_attn_out2, attn_weights_block2 = self.mha2(Q1, enc_output, enc_output, padding_mask, return_attention_scores=True) # (batch_size, target_seq_len, d_model)
# apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
mult_attn_out2 = self.layernorm2(mult_attn_out2 + Q1) # (batch_size, target_seq_len, fully_connected_dim)
Q2:
and also in class EncoderLayer(tf.keras.layers.Layer):
the shape of output of Encoder is said to be (batch_size, target_seq_len, d_model)
, but in the doc string of class DecoderLayer(tf.keras.layers.Layer), the shape of enc_output
is said to be (batch_size, n_target, full_connected_dim)
, why is that?
Thanks