Course 3 Module 3 Quiz 1 Question 3 frames the problem incorrectly

arvyzukai · November 28, 2025, 8:08am

The quiz implies that parallelism is the reason transformers need positional encodings.
This is misleading and not technically accurate.

Transformers need positional information because self-attention is permutation-invariant, not because computation is parallel.

Self-attention computes interactions among tokens without regard to order unless you explicitly inject order.
Even if the attention mechanism were computed serially, it would still be permutation invariant (you would still need to inject positional information).
RNNs preserve order not because they are non-parallel but because their state transition function encodes order by design.

So the real reason is architectural, not procedural.

Even though the choice is the one closest to the truth, my suggestion would be to use better phrasing, like:
“Transformers require positional encodings because the attention mechanism is inherently permutation-invariant; without them, swapping token order yields identical representations.”

If I remember correctly, some course video also ambiguously alluded that the “parallelism” was to blame (and it did not mention the permutation invariance) but that felt “ok”. On the other hand, the quiz answer and the explanation might mislead or confuse someone that the whole problem is around “parallel processing”.

Cheers

P.S. I really like the course one of the best

arvyzukai · November 28, 2025, 8:33am

By the way, the Quiz Question is also flawed. Even though my intuition was to go with the second answer, the third option is also correct:

Out of curiosity I tried to mark the third option and it was marked as incorrect:

The explanation given in the quiz is wrong (or at least oversimplified) and option 3 should not be marked incorrect.

The “different heads to focus on different parts” is not the same as “It’s not about dividing the sequence into parts”. This explanation is simply wrong or phrased inaccurately.

Yes, every head can attend to all positions, but even though all heads can attend everywhere, they typically don’t (that is not the problem, loss function and training “encourages” them to specialize). While all heads can attend to the full sequence, they typically do focus on different tokens due to their distinct learning trajectories (this is exactly how they “learn different kinds of relationships” as in option 2).

In short, options 2 and 3 are both valid, and the quiz explanation for the third choice is a somewhat distorted argument.

TMosh · November 28, 2025, 9:51pm

@mubsi, can you look into this?

Mubsi · December 2, 2025, 12:06pm

Sorry, I forgot to acknowledge. Yes, I did see this. This has been forwarded to the team to look at it.

Topic		Replies	Views
Help! I still don't understand how transformer works! Sequence Models coursera-platform	3	584	August 4, 2023
Week 4 Positional Encoding Sequence Models week-module-4 , coursera-platform	5	315	April 18, 2024
I can't quite understand the transformer structure NLP with Sequence Models week-module-4	8	1227	August 25, 2023
Transformer Pre Processing Lab Question Sequence Models coursera-platform	1	544	June 29, 2022
C1M4 Transformer architecture: Why is the position vector made up of 1 and 0s? Retrieval Augmented Generation week-module-4 , coursera-platform	5	35	September 9, 2025

Course 3 Module 3 Quiz 1 Question 3 frames the problem incorrectly

Related topics