Course 3 Module 3 Quiz 1 Question 3 frames the problem incorrectly

The quiz implies that parallelism is the reason transformers need positional encodings.
This is misleading and not technically accurate.

Transformers need positional information because self-attention is permutation-invariant, not because computation is parallel.

  • Self-attention computes interactions among tokens without regard to order unless you explicitly inject order.

  • Even if the attention mechanism were computed serially, it would still be permutation invariant (you would still need to inject positional information).

  • RNNs preserve order not because they are non-parallel but because their state transition function encodes order by design.

So the real reason is architectural, not procedural.

Even though the choice is the one closest to the truth, my suggestion would be to use better phrasing, like:
“Transformers require positional encodings because the attention mechanism is inherently permutation-invariant; without them, swapping token order yields identical representations.”

If I remember correctly, some course video also ambiguously alluded that the “parallelism” was to blame (and it did not mention the permutation invariance) but that felt “ok”. On the other hand, the quiz answer and the explanation might mislead or confuse someone that the whole problem is around “parallel processing”.

Cheers

P.S. I really like the course :slightly_smiling_face: one of the best :raising_hands:

1 Like

By the way, the Quiz Question is also flawed. Even though my intuition was to go with the second answer, the third option is also correct:

Out of curiosity I tried to mark the third option and it was marked as incorrect:

The explanation given in the quiz is wrong (or at least oversimplified) and option 3 should not be marked incorrect.

The “different heads to focus on different parts” is not the same as “It’s not about dividing the sequence into parts”. This explanation is simply wrong or phrased inaccurately.

Yes, every head can attend to all positions, but even though all heads can attend everywhere, they typically don’t (that is not the problem, loss function and training “encourages” them to specialize). While all heads can attend to the full sequence, they typically do focus on different tokens due to their distinct learning trajectories (this is exactly how they “learn different kinds of relationships” as in option 2).

In short, options 2 and 3 are both valid, and the quiz explanation for the third choice is a somewhat distorted argument.

1 Like

@mubsi, can you look into this?

Sorry, I forgot to acknowledge. Yes, I did see this. This has been forwarded to the team to look at it.

1 Like