the comment says
# apply layer normalization on sum of the output from multi-head attention and ffn output to get the
# output of the encoder layer (~1 line)
but if you check the diagram, for the " the output from multi-head attention", it should be the output of the norm layer (that is right after the mha layer.)
Thanks for putting out the comment. I had the same issue.
It appears that the verbal suggestions in the code are not as strict as they could be. Maybe this is worth revising to make more precise.
Yes, indeed ! Thanks for your contribution.
It did strike me when first writing that line of code then I forgot about it and got stuck with the AssertionError: Wrong values when training=True while unit-testing the function !
That assertionError comment is misleading as I did pass the training parameter…