C3M3_Lab_3_decoder: decoder implementation

I am trying to understand the Decoder class implementation of this lab (section 3.4). My understanding from the Decoders video is that nn.TransformerDecoderLayer and nn.TransformerDecoder modules are designed for encoder-decoder models, and that to implement a Decoder one uses an Encoder with a casual mask.

Furthermore, in the lab’s section (3.4) description it reads

Now let’s build a complete text generator using PyTorch’s optimized TransformerEncoder layers configured for autoregressive generation.

and

TransformerEncoder as Decoder: We use PyTorch’s TransformerEncoder with causal masking, which effectively creates a decoder

However, the implementation in the same section looks like this

# === DECODER ARCHITECTURE ===
# Use TransformerDecoderLayer 
# This has both self-attention AND cross-attention capabilities
dec_layer = nn.TransformerDecoderLayer(
    d_model=d_model,
     nhead=nhead,
     dim_feedforward=dim_feedforward,
     dropout=dropout,
     batch_first=True,    # Input format: [batch, seq, features]
     norm_first=True      # Pre-norm for training stability
)

# Stack multiple decoder layers to create deep network
# Each layer refines the representation further
self.transformer_decoder = nn.TransformerDecoder(dec_layer, num_layers)

Am I misunderstanding something? Does it result in a valid decoder only model when using a nn.TransformerDecoder here?

I think description is incorrect.

@Mubsi if you are available, can you please look into this

@ccaloian please note that staff is on holiday till 2 jan so response to issues might be delayed.

Regards
DP