Week 4: Attention is all you need

When creating a post, please add:

Week # must be added in the tags option of the post.
Link to the classroom item you are referring to:
Description (include relevant info but please do not post solution code or your entire notebook)

I don’t know if anyone can or wants to help, but I’m updating the “attention is all you need” jupyter notebook from the paper’s github repo for a modern version of pytorch. I was able to update and compile the model on the “attention is all you need” paper’s jupyter notebook using the course notebook above a a reference, but the course notebook doesn’t show how to interpret the transformer network’s output.

The “attention is all you need” notebook runs the decoder repeatedly to generate each output token:

trg_indexes = [trg_field["<bos>"]]

for i in range(max_len):

        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)

        trg_mask = model.make_trg_mask(trg_tensor)

        with torch.no_grad():
            output, attention = model.decoder(trg_tensor, enc_src, trg_mask, src_mask)

        pred_token = output.argmax(2)[:,-1].item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field['<eos>']:
            break

but all I’m getting is:

predicted sentence= ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']

Any thoughts what might be causing this?

Hi @ajacobvitz

Great work! However, you need to ensure that the model was trained enough and also check the loss values during training. Also, make sure that the decoder and encoders are working fine by checking their inputs and outputs.

For debugging, try printing intermediate values for output during decoding to understand what predictions are being made at each step. also, reduce max_len to a smaller value and inspect predictions step-by-step. If possible, compare outputs and intermediate states with a reference implementation (e.g., using a simpler input-output pair) to identify problems.

Hope it helps! Feel free to ask if you need further assistance.

1 Like

The transformer starts with predicting random words and then becomes optimized towards predicting “a”, “.”, “\n”, “” for all positions (I changed the batch size to 1, see output below). This makes me think there’s something strange with the loss function. I tried just feeding the loss function random inputs but pytorch has some “grad_fn” field that the model fills in and I assume is used for back propagation. I don’t know how to fill in the field manually.

I’m using:

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)

which supposedly skips over pad indexes, this line is from the original notebook.

@Alireza_Saei, do you know if there’s a way to inspect how the loss function is computed when using PyTorch? The documentation has equations, but I’m not sure over which dimensions to compute what.

The input SRC/TRG pairs for training look ok:

SRC:  ['<bos>', 'Ein', 'kleines', 'Mädchen', 'mit', 'einem', 'rosa', 'Würfel', 'in', 'ihren', 'braunen', 'Haaren', 'macht', 'ein', 'trauriges', 'Gesicht', '.', '\n', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
TRG:  ['<bos>', 'A', 'young', 'girl', 'with', 'pink', 'dice', 'in', 'her', 'brown', 'hair', 'has', 'a', 'sad', 'look', 'on', 'her', 'face', '.', '\n', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']

This is the output from the transformer after each training input pair:

EPOCH 1, start time 1732144480.1517425
['pocket', 'campground', 'relaxes', 'Times', 'Code', 'Princess', 'ballons', 'Teens', 'referencing', 'snowball', 'from', 'coastline', 'coastline', 'Times', 'Times', 'attentively', 'blacked', 'Teens', 'Times', 'Times', 'batch', 'pocket', 'coastline', 'tastes', 'Times', 'Teens', 'waterskis', 'Times', 'Times', 'Times', 'pocket', 'drawer', 'Toyota', 'pocket', 'pocket', 'Teens', 'coastline', 'snowball', 'pocket', 'pocket', 'tastes', 'knit', 'steel', 'Times', 'Code', 'snowball', 'Family', 'Toyota', 'Tartuffe', 'snowball', 'drawer', 'Times', 'Times', 'Times', 'attentively', 'ashtray', 'nose', 'Times', 'pocket', 'snowball', 'warmly', 'steel', 'ballons', 'perked', 'ha', 'Toyota', 'ballons', 'Code', 'Times', 'corks', 'Mrs.', 'Teens', 'Times', 'warmly', 'Times', 'ha', 'Costco', 'ha', 'Times', 'drawer', 'pocket', 'laughs', 'perked', 'leaking', 'Times', 'camera', 'steel', 'Code', 'Times', 'Times', 'Times', 'Machine', 'Artwork', 'Times', 'attentively', 'waterskis', 'maid', 'Times', 'Times', 'Times']
loss tensor(9.2950, device='cuda:0', grad_fn=<NllLossBackward0>)
['older', 'ethnicity', 'older', 'older', 'with', 'competitive', 'paperwork', 'At', 'shoe', 'older', 'older', 'older', 'competitive', 'darkly', 'Average', 'older', 'shoe', 'projection', 'contently', 'older', 'free', 'formally', 'skill', 'older', 'lesbians', 'older', 'Vegas', 'older', 'older', 'ethnicity', 'older', 'formally', 'older', 'competitive', 'Vegas', 'older', 'ethnicity', 'older', 'older', 'arrival', 'competitive', 'older', 'older', 'lesbians', 'older', 'lesbians', 'knit', 'ethnicity', 'ethnicity', 'guides', 'shoe', 'formally', 'older', 'formally', 'shoe', 'Inside', 'older', 'plucking', 'Derby', 'lesbians', 'older', 'older', 'paperwork', 'older', 'older', 'older', 'lesbians', 'older', 'older', 'competitive', 'pocket', 'paddling', 'projection', 'Derby', 'older', 'jockeys', 'older', 'older', 'ethnicity', 'older', 'older', 'flower', 'dayglo', 'competitive', 'formally', 'capital', 'older', 'shoe', 'older', 'older', 'shoe', 'projection', 'older', 'competitive', 'Machine', 'paddling', 'older', 'Average', 'older', 'older']
loss tensor(9.2938, device='cuda:0', grad_fn=<NllLossBackward0>)
['a', 'a', 'older', 'older', 'projection', 'a', 'older', 'with', 'older', 'a', 'a', 'a', 'a', 'a', 'a', 'older', 'older', 'a', 'a', 'applause', 'older', 'older', 'active', 'a', 'older', 'older', 'bathroom', 'older', 'a', 'Vegas', 'a', 'Carhartt', 'a', 'bathroom', 'older', 'older', 'a', 'a', 'a', '<eos>', 'a', 'right', 'older', 'a', 'older', 'a', 'older', 'a', 'projection', 'older', 'a', 'a', 'older', 'older', 'older', 'shoe', 'older', 'a', 'a', 'a', 'active', 'with', 'a', 'blacked', 'a', 'a', 'a', 'older', 'a', 'projection', 'older', 'older', 'older', 'older', 'a', 'a', 'a', 'a', 'older', 'a', 'a', 'with', 'older', '<eos>', 'a', 'older', 'older', 'a', '<eos>', 'a', 'Derby', 'older', 'a', 'older', 'a', 'a', 'older', 'with', 'with', 'a']
loss tensor(9.1648, device='cuda:0', grad_fn=<NllLossBackward0>)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '\n', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'has', 'a', 'a', '.', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '.', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

Thanks!

1 Like

I actually think I got it working, I shrunk the input vocabulary size and sped up the learning rate and now it’s producing sentence like outputs during training.

1 Like

If the model explicitly defines its loss function as an attribute, you can directly check for it.

If the loss function is not a direct attribute of the model, it’s likely being passed or defined in the training loop. Search for where the loss is computed in the code.

If you’re using a framework or higher-level library that abstracts the training process like Trainer you can this code to see the loss function: trainer.args.loss_function


I’m glad you were able to fix the problem! You’re welcome, and feel free to reach out if you have more questions. :raised_hands:

1 Like