Pedagogy of C5 W4 A1 Transformer (Ex4: Encoder Layer)

Hi, I just want to start this by saying how impressed I am with the pedagogy overall throughout all four classes. Concepts really build one on top of the other and all challenges–and make no mistake challenges are necessary for learning–really find the right level that test but aren’t past a student’s zone of proximal development. As a former high school math teacher and having earned an EdD at Harvard’s Graduate School of Education, I don’t just hand out compliments for pedogogy to just anyone; it is well deserved here.

I have to say, however, the current state of Course 5 Week 4 is lacking behind the standards set by all of the previous lessons, activities, and courses to date. There are many, many conceptual jumps (both in the video and in the assignment) that, in my estimation, are well beyond most students’ zone of proximal development. The result is that, even if students eventually “figure out the answer”, I don’t think most will really have gained the same level of understanding of the concepts that is gained in the other lessons.

One prime example of this is Exercise 4 (“EncoderLayer”) in the only assignment for the week (Assignment 1). I was able to get to the correct answer by using the guide TensorFlow puts out on their website for Transformers. But the syntax, honestly, makes zero sense to someone like myself who has a coding background, has been able to follow all of the assignments to this point, but may not be an expert in Python, yet. In other words, I have no basis to understand the syntax, and the one bullet point in the assignment that addresses the __init__ method is far from sufficient in helping me understand the coding paradigm that is going on here. And it is this level of understanding that I am after–not some certificate.

I say all of this because I think a lot of students are going to be frustrated with this week’s materials. Not the least of which is because this is the last week of a course but this week’s lessons and assignment break so many conventions and expectations set previously in all of the other courses.

Will committed students get to the correct answer? Many will find the resources online to arrive at an answer, sure. But will they really understand what is going on? No, not unless they have a strong background in the material already–a background that is not assumed in all of the other courses and lessons.

Bottom line–more scaffolding is sorely needed to reflect the best practices in pedagogy that manifests in all of the other lessons.

Some suggestions:

a) Videos are shorter this week, so there are opportunities to provide that to students and prep them for the lab.
b) In other labs there is example syntax to help scaffold concepts before students go off and code on their own. There are ways of doing this that help scaffold the concept without “giving the answer”.
c) Simply acknowledging that this weeks lesson is not like the other will help students plan accordingly and devote more time to it then they may reasonably expect. Having taught graduate-level courses, it is well understood that the “flunk you” assignment is supposed to come at the beginning of the course, not at the end, especially when it may mean the student will have to pay for an extra month unplanned to earn a certificate.


Completely agree with @marcus-waldman . For anyone struggling, Transformer model for language understanding  |  TensorFlow Core was really helpful.


@marcus-waldman your message has generated a lot of discussion in the mentors area and with the course staff (as an example we held an hour-long session with over 15 attendees). The staff are listening to the problems.

Thank you very much for expressing the issues so clearly and constructively.


Just sharing my personal experience. I took the DLS a few months back when it was still based in TF1 and found those parts using TF and Keras a difficult to follow as well. Since then I took TensorFlow 2 for Deep Learning Specialization by Imperial College London which is also on Coursera. Now I am coming back to DLS for the refreshed material i.e. the new network architectures in the content and I found it much easier to follow due to the background I got from the TF2 course. Not sure it is the answer you are looking for but it is one way to get the understanding you are seeking.

P.S. While the content in the TF2 course is pretty good, there are some issues with the assignments and the support you get from the course creators is lacking which is unlike the DLS where you have multiple mentors answering questions. However, just following the content was sufficient for me to gain the required understanding.

1 Like

Can also agree with @marcus-waldman. As a Python programmer with less experience in TensorFlow, this weeks programming exercise was difficult. Compared to the previous programming exercises, the tasks are insufficiently described. To be honest, I’m not sure I would have made it through the exercise without the reference to the TensorFlow help (thanks to @lmmontoya4490). It’s time for a TensorFlow course …


Couldn’t agree more. This last assignment is purely Tensorflow and Keras. So now I know I should’ve expended my money in a Tensorflow/Keras course. Just a shame comparing it with the rest of the DLS.


Absolutely spot-on @marcus-waldman. Looks like the last part of the course (with its associated assignment) have been made in a hurried fashion to pack in some recent developments. The materials on transformers is covered too superficially to gain any meaningful understanding of the concepts. I had to read through four to five blog posts to understand this material. Maybe a compromise can be made to cover lesser ground but cover that ground more firmly.


Spot on. Beautifully put on a very constructive way, too. :slight_smile:

To be honest, I’m not sure how this last exercise could be released in it’s current form. It is a lot more difficult than any other assignments and as said above, it’s far above the proximal knowledge of where we’ve got so far doing this great course.


I think this Transformer could even be a 4 week whole course on its own. It was missing for a long time from the original specialization, and sort of “padded” on to make it more relevant, since i heard LSTM/RNN really fell out of favor for NLP. So i totally am not surprised at some of the criticism levelled here.

But i still like A. Ng heavy emphasis on high level intuition and understanding. I think a week on transformer is by no mean sufficient but it is a good foundation background when you read other papers, tutorial, etc. I agree the prerequisite and experience required is a step up. I almost have the feeling it is designed for those who completed this specialization 2 yrs ago and accumulated more knowledge, and came back to do “Transformer”.


There are now three additional ungraded labs added to Week 4.

1 Like

Spot on and can’t agree more.

I kept thinking to myself how could this last lab be so different (in terms of the difficulty, and the clarify in instructions) from all the prior lessons. I appreciate the efforts trying to incorporate the most updated concepts into the course, and maybe serve as a step stone to any other specializations or courses that goes into further detail. However, if the purpose is to stay conceptual in this last week, as demonstrated by the length of the videos, the assignment should be of the same purpose.

1 Like

Hi TMosh, will the ungraded labs help to understand assignment 1? At the moment I am quite frustrated trying to understand this assignment…



This lab appears to be a copy of the free material at Модель-трансформер для понимания языка  |  Text  |  TensorFlow, with some helpful details and explanations omitted.


Thanks for this post, i’m really suffering this lab as well and i’m totally identified with your point.


One more vote here!

I’m coming to this course after a long background in math and python, and after having just completed the original “Intro to Machine Learning” course by Andrew on Coursera.

For what it’s worth, I’ve been sailing through the entire Deep Learning Specialization until hitting this last week of Course 5.

Some questions that I was left with after the videos:

  • Where the heck do the Q/K/V matrices come from?
  • What are the dimensions of the self-attention layers?
  • Why do Q/K dot products “answer questions” about the words?
  • Why are we multiplying by V?

I’ve taken many pages of detailed notes on the entire DLS, but on the “Full Transformer Network” architecture section, I could only write “Go through ‘Attention is All You Need (2017)’ in detail to figure this out”

And also +1 to comments about the programming assignment. it’s horrific. The errors are inscrutable without being an expert in tf, and you end up either plugging things in blindly into very sparsely explained functions, or like I suspect the vast majority (including myself!) did; following the tf Transformation tutorial, of which this exercise is a copy&paste with slight adjustments.


I was filled with admiration and appreciation for the way the previous programming exercises were put together and augmented my understanding of the material. I especially appreciated the improved unit-testing that was introduced - that was fantastic.

This programming exercise - you threw your students into the deep end with this one. Really a sour note to finish a 17-week commitment that involved giving up a lot of my weekends.

Thank you to the original poster for doing such a good job articulating the challenges and differences. And thanks to all involved in these classes for the great work. Overall a fantastic journey. Please - fix this last assignment so your students can finish out on a good note.


I could not have stated this better @marcus-waldman .

All, if you struggle, and want to really understand transformers, here is the best resource I could find : Transformers from scratch | .

This article is an absolute wonder of pedagogy.

That’s what I would have expected from a deep learning course from Andrew Ng and his team.

1 Like

Totally Agree. Well said!


I’d also agree with Marcus’ point, even with the current (Aug21st) version of course 5, W4. The Transformer assignment seems to be at a quite different level of expectation on the student. I’ve been able to work through most of the implementation and could infer most of what was required from the very brief, and occasionally incomplete comments (eg final_output --describe me certainly looks like a placeholder. )

My background is in software development (BEng, MSc, PhD) and I’ve taught at a university level, it certainly was a very notable difference in level of how much was left for the student to make pretty significant leaps. I also felt the transformer lectures seemed quite brief and expected a great deal of understanding from the student or much of it was not explained in very much detail, compared to the rest of the course.

Overall this course and the whole specialization is excellent, it just seems this final piece is a bit rough/ rushed, even with the changes that have been attempted.


Can’t agree anymore!

I love this DL specialisation course, it is so great that even someone like me, who have barely no background in programming, can write my own simple deep learning program. I would never believe I can do this before I took this course, programming and deep learning sounds super difficult to me back then.

However I just so confused when I was working on the assignment of last week. I got the point why Transformer Network is a big break through in NLP field and some details of it from the lecture. But when it came to the programming assignment, I started to feel so lost from the second exercise.

What is init and what is call? Did I missed something from the lecture?
Why I should write ‘self.mha(x,x,x,mask)’ instead of ‘self.mha(x)’? I got this line from the forum after hours of searching for the understanding what went wrong in my code.

Maybe it’s better to modify the last assignment in an easier but fun way. Like the assignment in the early week, for example the Dinosaurus_Island and the Improvise_a_Jazz_Solo. This can show us how could we implement this Transformer Network. Also, it’s easier for us students to understand if it is based on a practical example.

Thank you all the tutors and professor A.Ng. This is a great course!