Course 5 - Week 4 - understanding EncoderLayer dimensions

While I was able to get EncoderLayer to pass the tests, I remain confused about what’s going on with the tensor shapes in the guts of the encoder. What the comments say is not what I expect, which I take as a sign that I’m mistaken.

Specifically, after the attention output has been passed through the fully connected layer, the comments in the exercise say that the shape at this point, of ffn_output, should be
(batch_size, input_seq_len, fully_connected_dim)

This confuses me because the sequential FullyConnected layer, defined just above the exercise, has fully_connected_dim activations in the first layer, and then embedding_dim activations in the second and final layer. Doesn’t this mean that the output tensor should have a dimension of embedding_dim in its inner axis?

Related to this, I was trying to figure out why the input x has dimension fully_connected_dim in its inner axis, as stated in the ‘def call()’ comments. In the test cell below the exercise, and I noticed that the test encoder layer is defined to have fully_connected_dim = 8, seen on this line.
encoder_layer1 = EncoderLayer(4, 2, 8)

But then the input tensor is
q = np.array([[[1, 0, 1, 1], [0, 1, 1, 1], [1, 0, 0, 1]]])
which clearly has dimension 4, not 8, in its inner axis, which happens to be the embedding_dim in the first argument passed to EncoderLayer.

All of this is making me think that the comments are wrong, and that input x in fact has shape
(batch_size, input_seq_len, embedding_dim), as well as the output of ffn_output, which makes way more sense to me.

Admittedly my coding skills are very poor, so I’m checking in here to ask if there’s something that I’m misunderstanding here. I’d like to make sure that I understand every step of the transformer architecture.

Hi Alex,

Thanks to report the problem. Yes, you’re right. The output shape of fully connected layer should be (batch_size, input_seq_len, embedding_dim). I’ll submit a git issue.

In fact, just like what you found, encoder layer input, MultiHeadAttention layer output, fully connected layer input/output, all of these MUST have the same shape, because there are skip-connections (similar to ResNet in course 4) in between, they have to maintain the same shape. Besides, encoder layer input x should have shape (batch_size, input_seq_len, embedding_dim), too (b/c, for language model, x is an embedding vector sequence.)


Hi Edward,

Thanks for the clarification! Good to know that it wasn’t me.