Another improvement suggestion in C5_W4_A1_Transformer_Subclass_v1

print(tf.keras.activations.softmax(x))
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9))

should be

print(tf.keras.activations.softmax(x))
print(tf.keras.activations.softmax(x + (1 - tf.squeeze(create_padding_mask(x), axis=1)) * -1.0e9))

in order to match the size of x.

Thanks for the suggestion.