Question about Supervised Fine-tuning in Module 3 Lab

Hi everyone,

I’m working on the fine-tuning assignment, and I noticed something in the tokenize_and_format function that seems incorrect for Instruction Tuning.

In the current code, we perform the following operation:

model_inputs['labels'] = model_inputs['input_ids'].copy()

By copying input_ids directly to labels without modification, we are calculating loss on the entire sequence (both the Question and the Answer).For Instruction Tuning, shouldn’t we be masking the user’s prompt?

The model is being trained to predict the Question and the Answer (Standard Causal Language Modeling), but it should see the question as context but only be trained to generate the answer (Masked Language Modeling for the prompt).

Has anyone else encountered this, or can we confirm if this lack of masking is intended?

hi @ArmaganEr

model.inputs[labels] = model.inputs[input_ids].copy()

This code is typically used within a data processing function to create a separate copy of the input_ids and assign it to the labels key in the input dictionary. This is specifically done when the model is being trained to predict the input sequence itself (or a variation of it) as the target output.

model.inputs: This likely refers to a dictionary or mapping that holds all the necessary inputs for the model (input_ids, attention_mask.)

input_ids: These are the numerical representations (indices) of the tokens in the input text after tokenization.

.copy(): This creates a deep copy of the input_ids list/array. This is crucial because modifying the labels later (e.g., setting some tokens to a special “ignore” index for padding) should not affect the original input_ids.

[labels]: This is the key under which the target values the model is expected to learn are stored.

So when executing this code line, the labels data mirrors the input_ids data at specific moment while model training.

For this input, we then reproduce this exact sequence of tokens.

Subsequent processing within the function would then likely refine these labels.

For example, such as special tokens (like padding or prompt tokens) are ignored during the loss calculation.

Thanks for the explanation. My question is rather in which function and line exactly in the assignment the prompt tokens are masked. Unless SFT Trainer that the assignment uses automatically ignores prompt tokens, I don’t see a processing step in the entire assignment that creates masking tokens for the prompt tokens.

let me go through the lab codes completely and get back to you.

the code where you had doubt is usually used in the Hugging face dataset tokenization.

can you confirm the lab name again please

It’s module 3 lab named Evaluation and Debugging Lab. Thanks.

try to refer the utils.py file you will notice this part of the code

Also instead of masking the prompt token, here attention mask is used as setting tokenizer.pad_token = tokenizer.eos_token primarily enables efficient batch processing by providing a token ID for padding that is a part of the model’s existing vocabulary.

How this is achieved

  1. The models process inputs in batches and require all sequences to be of uniform length. Shorter sequences are filled with padding tokens to match the length of the longest sequence.

  2. Setting the pad token ID to the End-of-Sequence (EOS) token ID means the model can reuse its existing, trained embeddings for the EOS token for padding, avoiding the need to add and randomly initialize a new token (which would require further training)

  3. The tokenizer, when applying padding, also generates an attention mask. This mask is a binary tensor that explicitly tells the model which tokens are actual content (1) and which are padding (0).

  4. During the model’s forward pass (specifically in the self-attention mechanism), the attention mask is applied. Tokens with a mask value of 0 (the padded ones, which are now the EOS token ID) are ignored and do not contribute to the attention calculations or the loss function.

Remember as I mentioned earlier in this post response the model.inputs already has the inputs_ids and attention mask which is used in the model training.

If you notice the labs uses Autotokenizer from transformer model, and when you encode the text using this tokenizer.encode, it returns a dictionary like object (BatchEncoding containing these input_ids and attention mask.

So when you are generating Q&A response or generation text, the tokenizer.pad_token which is assigned to tokenizer.eos_token, the attention mask (which is a tensor of 1s and 0s representing token which are actual content(1) and which are padding(0).