I am trying to understand the Decoder class implementation of this lab (section 3.4). My understanding from the Decoders video is that nn.TransformerDecoderLayer and nn.TransformerDecoder modules are designed for encoder-decoder models, and that to implement a Decoder one uses an Encoder with a casual mask.
Furthermore, in the lab’s section (3.4) description it reads
Now let’s build a complete text generator using PyTorch’s optimized TransformerEncoder layers configured for autoregressive generation.
and
TransformerEncoder as Decoder: We use PyTorch’s TransformerEncoder with causal masking, which effectively creates a decoder
However, the implementation in the same section looks like this
# === DECODER ARCHITECTURE ===
# Use TransformerDecoderLayer
# This has both self-attention AND cross-attention capabilities
dec_layer = nn.TransformerDecoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True, # Input format: [batch, seq, features]
norm_first=True # Pre-norm for training stability
)
# Stack multiple decoder layers to create deep network
# Each layer refines the representation further
self.transformer_decoder = nn.TransformerDecoder(dec_layer, num_layers)
Am I misunderstanding something? Does it result in a valid decoder only model when using a nn.TransformerDecoder here?