In version “v1.6” of the C5_W4_A1_Transformer_Subclass_v1 there is an error in the comments of Part 4’s coding assignment that is bound to trip people up.
Where it says “pass the output of the multi-head attention layer through a ffn (~1 line)” in exercise 4 it should say “pass the output of the normalized multi-head attention layer through a ffn (~1 line)”
Maybe that would make it clearer, but here’s the comment in the template code for the previous line of code:
# skip connection
# apply layer normalization on sum of the input and the attention output to get the
# output of the multi-head attention layer (~1 line)
In other words what they mean by “the output of the multi-head attention layer” is only ambiguous if you missed the point of the previous comment. And the diagram in Figure 2a.
But more clarity is never a bad thing
…