Specific to the assignment:
There are some bugs in the assignment and it would be appreciated if the staff did a round of proofreading.
- padding mask example in 2.1 seems wrong as it changes the dimension. Or the mystery should be explained.
- some of the comments which seem more crucial to understanding the code this week have bugs or need polishing. Of the more significant kind, none of the tensors should have any of their shape equal to fully_connected_dim. This is internal to the FC and neither of the Encoder or Decoder Layer code have inputs or intermediate values that have this as the last shape param. It should be replaced with embedding_dim. This is obvious after some reading but is quite disruptive to digesting the code at the first scan. Also, some of the shapes have d_model as a param. I think again you should probably use embedding_dim as d_model is defined all the way in positional encoding code. Probably should be changed as well as embedding_dim seems more natural. The definition of FullyConnected also has d_model as well as dff as last shapes. This is the only instance of dff in the entire notebook – probably a typo.
Lastly, given that the lecture was lacking in details and this assignment really fills in the gaps, it would be nice to include more explanation. For instance, why is the Decoder and in turn Transformer, returning attention weights?
Could we do something beyond writing the code for Transformer + a unit test to end the assignment? A small training job? A walkthrough of either a prediction (or better yet training) iteration to see the flow of computation?
In general, Week 4 lectures could use more content in terms of 1) some of the details skipped, and more importantly 2) intuition and ideas behind design decisions. This is especially true for self attention. What is actually happening with Q,K,V? How do these become what they become? Please explain the attention equation more beyond essentially “it looks kind of like softmax”. In particular, what is the difference between K and V? Why are we passing encoding output twice as both K and V? Why does the softmax explode if we don’t divide by sqrt(d_k), and why by this quantity? This weeks coverage does not match the level of Week 1–3 or the rest of the specialization and it would be of great service if it did (esp considering the impact of transformers today).
Thank you.