Transformer parallelism

Transformers apparently do everything in parallel, but it looks like the decoder still works like an RNN; it has to reuse the previous predictions to generate the next word in the sequence every time. That isn’t done in parallel right?

Decoder layers are stacked. As you’ve observed, there’s no parallelism here.