Module2 Graded Lab1 padding and eos tokens

Questions: 1) why is padding added at the beginning of the sequence rather than the end? Would it make a difference in performance? 2) Is there an end-of-sentence token? If not, why not? 3) In other sft tutorials, I have seen the use of “chat templates” which show the model the roles and appropriate responses. I don’t think I saw a chat template being defined in this lab

hi @aezazi

Padding goes at the beginning (pre) or end (post) mostly for efficiency/convention in Transformers (like BERT/GPT), where pre-padding helps with positional encodings aligning meaning, but post-padding is fine too; it affects performance slightly by changing what information gets masked, but modern models handle it, with yes, End-Of-Sentence (EOS) tokens are crucial for generation, signaling completion, preventing infinite output, and helping models understand sentence boundaries, unlike padding tokens which just fill space.

Why it’s done: Models need fixed-size inputs, so shorter sentences get padding tokens (like 0 or ) to match the longest in a batch.

Beginning (Pre-padding): Common in models like BERT (Transformers) because positional embeddings (which tell the model word order) often assume padding is at the start, so the actual content starts at a known position.

End (Post-padding): it’s simpler and common in RNNs/LSTMs where order matters less for padding itself, just for the actual sequence.

Pre-padding might subtly affect how positional encodings work, but models are designed to handle both. The main performance impact comes from how much padding is needed (long sequences = more padding = less efficient batching).

End-of-Sentence Token (EOS)

Yes, it exists and is vital: An token marks the true end of meaningful text.

Significance:

Generation: Tells text-generating models (like GPT) when to stop generating words, preventing endless gibberish.

Encoding: Signals to the encoder that the input sequence is complete for processing.

Crucial Distinction: It’s not the same as a token; it’s actual learned information, whereas padding is ignored (masked).

Also remember setting pad_token = eos_token (Precaution): Sometimes done for simplicity, but it can confuse models during training, as padding tokens aren’t meant to be meaningful outputs like EOS tokens.

Hope this helps!

Regards

DP

1 Like

This is a great opportunity for you to do some experiments.
Please report back your results.

1 Like

Thanks so much for the thorough response. A few follow-up questions and comments. As I mentioned in my initial question, I followed another SFT tutorial using the “mistralai/Mistral-7B-v0.1” model and the “HuggingFaceH4/ultrachat_200k” dataset. In the tutorial, they use post padding, which is what I did. Something that did not make sense was that they used the EOS token for padding. So I created a special padding token and added it to the tokenizer. Also, the dataset chat template messages did not include a “system” role, so I added that as well to the chat template along with some modifications to create a proper chat template from the dataset. Adjusted model embedding size accordingly. Unfortunately, I don’t think my model is training properly. Very little (although consistent) drop in validation loss. I think my chat template and tokenizer implementation is correct, but not sure. The other issue I faced was SFTConfig() parameters. There seem to be dozens of parameters which I find quite confusing at this stage. Perhapas thats the source of the problem. These parameters can conflict LoraConfig(), which another source of confusion for me. Anyway, perhaps the rest of the course will clarify some these. I am attaching my code for creating custom tokenizer/chat template and trainer in case you or anyone else has the time to take a look. @TMosh

custom_tok’er_dataset_v3.py (10.4 KB)

sft_train_v3.py (11.7 KB)

hi @aezazi

can you post screenshot where you mentioned they used padding for end of sequence and what confusion did it create.

Padding adds special tokens (like zeros) to the end (or beginning) of shorter sequences in a batch so all sequences have the same length, making them processable by deep learning models (RNNs, Transformers) that need fixed-size inputs, ensuring efficient batch processing and retaining all original data, often using a mask to tell the model to ignore these padded parts during calculations.

When Padding is Used:

  1. Batch Processing: Models train on batches of data. padding standardizes sequence lengths within a batch for tensor operations.
  2. Fixed-Length Inputs: Many models, especially Transformers, require uniform input dimensions.
  3. Efficiency: Avoids truncating longer sequences, preserving all original information.

How it works:

Set Max Length: A maximum sequence length L is chosen (often the longest in the batch or a predefined value).

Add Padding Tokens: For sequences shorter than L, padding tokens (for eg. 0) are added to the end.Example: “Hello world” (2 tokens) becomes [101, 102, 0, 0] if padded to length 4.

Masking: A corresponding mask is created (e.g., [1, 1, 0, 0]) to tell the model to ignore the 0s during loss calculation and attention, preventing them from affecting the output.

Regards

DP

@aezazi

I tried opening your sft notebook and it showed an error stating incorrect token ## being used and then notebook doesn’t open.

can I know the special token you mentioned to have added in the sequence is #???

if you can share screenshot of the code from your notebook or the training results, probably I can have look why the results are not good.

I also didn’t get what you mean by adding system role to chat template?

are you creating a RAG architecture?

also just to be sure I noticed your category! is shifted from ai discussion to a particular course, so confirming if your notebook is a graded assignment, i highly suggest you to delete it from the public post as it would be against community guidelines.

Hi DP,
Apologies for the late reply. Just got back to working on the course again. First regarding the code I posted, it’s based on a completely different tutorial in SFT and is completely unrelated to the lab other than it is a SFT implementation. Going back to my original question, I understand the purpose of padding and padding masks. To clarify, the mistralai/Mistral-7B-v0.1” model tokenizer does not have a special dedicated token for padding. So this tutorial was using the eos token as the padding token. This did not make sense to me so I created a dedicated padding special token. I also modified the tutorial chat template since it did not include a system role and created a few other special tokens accordingly. Here is the code snippet for all this along with a test to demonstrate the chat template and tokens:
#%%

# add padding special token to the model tokenizer to facilitate chat template

“”"

In order to perform SFT, we need to create a “chat template”. A chat template basically adds special tokens to the conversations in our dataset to help the model learn a chat conversational structure.

After some research, I decided to use a template structure suggested by Claude. I also decided to use a dedicated pad token instead of using the eot token for padding as is sometimes done. I find these tokens to be a lot more readable and easy to follow when testing and debugging. However, note that there are many approaches to creating chat templates. The main takeaways from my research were to be consistent and avoid designs that might cause the model to get confused as to the purpose of the token. This is why I decided to use a dedicated pad token instead of using the eot token as padding. I was never able to understand how models that do this avoid confusing a legitimate eot token with padding.

Here is an explanation of why the special tokens are created with the pad token getting it’s own individual key while the other custom tokens are placed in a list with key “additional_special_token”

The Two Categories of Special Tokens

1. Standard Special Tokens (dedicated keys)

These have predefined roles across all Huggingface tokenizers (although tokenizers may well use just a subset):

bos_token - Beginning of sequence (e.g., )

eos_token - End of sequence (e.g., )

pad_token - Padding token

unk_token - Unknown token

sep_token - Separator token (used in some models like BERT)

cls_token - Classification token (used in some models like BERT)

mask_token - Mask token (for masked language modeling)

These have specific behaviors built into the tokenizer. For example:

pad_token is automatically used when you pad sequences

eos_token might be used to signal when generation should stop

2. Additional Special Tokens (list)

These are custom tokens you want to add that don’t fit the predefined roles:

additional_special_tokens - A list of any custom special tokens you want

These tokens are treated as special (won’t be split during tokenization) but don’t have automatic behavior.

Why pad_token gets its own key:

The tokenizer needs to know: “When I pad, use THIS token”

When you call tokenizer.pad(), it automatically uses tokenizer.pad_token

It has functional significance beyond just being “special”

Why the others go in additional_special_tokens:

They mark structure in your chat format

But the tokenizer doesn’t need to automatically use them for anything

You manually insert them via your chat template

The key insight: dedicated keys give tokens automatic behavior, additional_special_tokens just marks them as “don’t split these during tokenization”. For chat formatting, you usually want full manual control, so additional_special_tokens is the right choice for <|im_start|> and <|im_end|>.

FYI:

These attributes exist on ALL HuggingFace tokenizers

tokenizer.bos_token

tokenizer.eos_token

tokenizer.pad_token

tokenizer.unk_token

tokenizer.sep_token

tokenizer.cls_token

tokenizer.mask_token

# And their IDs

tokenizer.bos_token_id

tokenizer.eos_token_id

# etc.

“”"

# Add special tokens for chat SFT

special_tokens_dict = {

“pad_token”: “<|pad|>”,

“additional_special_tokens”: [“<|user|>”, “<|assistant|>”, “<|system|>”]

}

tokenizer.add_special_tokens(special_tokens_dict)

tokenizer.pad_token = “<|pad|>”

tokenizer.padding_side = “right”

model.resize_token_embeddings(len(tokenizer))

model.config.pad_token_id = tokenizer.convert_tokens_to_ids(“<|pad|>”)

# inspect special tokens

print(f"Padding token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

print(f"Special tokens: {tokenizer.additional_special_tokens}")

print(f"Tokenizer vocabulary size: {len(tokenizer)}")

print(f"Special tokens added: {tokenizer.all_special_tokens}")

#%%

# ============ Sanity check: tokenizer and model alignment ====================

print(“Tokenizer vocab size:”, len(tokenizer))

print(“Model embedding size:”, model.get_input_embeddings().weight.shape[0])

# Verify special tokens are recognized

for tok in [“<|pad|>”, “<|user|>”, “<|assistant|>”, “<|system|>”]:

tok_id = tokenizer.convert_tokens_to_ids(tok)

print(f"{tok}: ID={tok_id}")

assert tok_id < model.get_input_embeddings().weight.shape[0], “:cross_mark: Token ID out of range!”

print(“:white_check_mark: Model and tokenizer are fully aligned!”)

#%%

# Define ChatML template with non-assistant masking

chat_template= “{ for message in messages }\n{ if message\['role'\] == 'user' }\n{{ ‘<|user|>\n’ + message[‘content’] + eos_token }}\n{ elif message\['role'\] == 'system' }\n{{ ‘<|system|>\n’ + message[‘content’] + eos_token }}\n{ elif message\['role'\] == 'assistant' }\n{{ ‘<|assistant|>\n’ + message[‘content’] + eos_token }}\n{ endif }\n{ if loop.last and add_generation_prompt }\n{{ ‘<|assistant|>’ }}\n{ endif }\n{ endfor }”

tokenizer.chat_template = chat_template

#%%

# Create function apply chat template to datasets

def formatting_func(example, tokenizer):

messages = example[“messages”]

# Ensure system message exists

if len(messages) == 0 or messages[0][“role”] != “system”:

messages.insert(0, {“role”: “system”, “content”: “”})

example[“text”] = tokenizer.apply_chat_template(messages, tokenize=False)

return example

#%%

# test the formatting function

# Mock conversation

test_example = {

“messages”: [

    {"role": "system", "content": "You are a helpful AI assistant."},

    {"role": "user", "content": "Hi, can you tell me a joke?"},

    {"role": "assistant", "content": "Sure! Why did the math book look sad? Because it had too many problems."},

    {"role": "user", "content": "Haha, another one please!"},

    {"role": "assistant", "content": "What do you call a fake noodle? An impasta!"},

\]

}

# apply formatting function to example

formatted = formatting_func(test_example, tokenizer)

print(formatted)

what I was trying to ask was whether it is correct to structure a chat template this way and modify the tokenizer and model accordingly. I used right padding instead of left padding which might also be an issue. In general, when doing sft ot rlhf, what sort of loss curve should we look for. One explanation for why I did not see much of an improvement in the training loss where as the validation loss was improving slowly might be that the base model is already pretty well trained. thanks for your help

@aezazi

kindly DM me the link to the lab where you run down this code, so I can check myself what could be issue.

Also just to confirm the thing you have added in the original code is additional special token, that’s it right?

I need these information to compare with your lab codes with your updated codes

Hi DP,
Just to try and clarify my original question again, I finished the M2 graded lab2. The following code from the “Load the Tokenizer” cell is the crux of my original question:

Set up padding token

# Padding is used to make all inputs the same length

if tokenizer.pad_token is None:

# If no padding token defined, use the end-of-sequence token

tokenizer.pad_token = tokenizer.eos_token

# Set padding to left side for generation tasks

# This ensures the actual text is on the right (where model expects it)

tokenizer.padding_side = “left”

If the tokenizer does not have a dedicated padding token, why is it ok to just use the eos_token for padding. Does this not confuse the model as to what is a true eos vs. padding? Why not create a dedicated padding token?

please DM me the original link to the codes, you can share by personal DM, if privacy is concerned.

i understood your question, but I need to check the complete code and compare with your updated codes to be able to answer you in more significant way as you are already confused about end of sequence for padding.

the original code

tokenize.pad_token=tokenize.eos token sets the PADDING TOKEN( to be used for filling shorter sequences in a batch) to be same as the end of sequence(EOS) token, simplifying models by using one token for both tasks, processing and generation of text, and reducing complexity.

Although an usual problem with this method is model can prematurely stop to generate text(treating the padding as actual end of content) or include unwanted padding in the output even though the model is trained to distinguish the first EOS as content end and subsequent ones as padding.

This actually I personally have experience with chatgpt when a query it was still answering but prematurely ends with a message stating your session has expired. (This happened especially when I was more cross questioning chatgpt response.

I still don’t know which part of codes you changed particularly as you mentioned adding a system role !!!

but I noticed one thing in your previous response where roles are bit confusing when you added system, then what is assistant significance in the chat template, it is like you added more noise or confusion to your template as assistant is already working as system ai assistant and you adding system role probably confused the model accuracy.

if you need a proper reasoning, I need to go through the complete codes(you can send me by DM but kindly send link to the notebook you are working or send the original notebook first and then the updated one for better comparison)

Good luck! feel free to ask if any doubt

regards

DP

thanks DP. this was helpful. I’ll clean-up my code and dm you.

1 Like

I was going through your sft tutorial ipynb, I noticed this in beginning

So the role of system is already being handled by the assistant as I mentioned in my earlier response and you adding system to this must have caused to act the two roles as two different tools like an agent.

You also have an Type error??

Hi DP,
thanks for the reply. The code you are referring to is from the tutorial. As I mentioned in my previous comments, I could not get it to run because a number of the sftconfig parameters he is using have been deprecated by huggingface. As I noted in my DM, my code, which runs and ostensibly trains, are the two files ending with .py. I sent you the tutorial, which ends with .ipynb, as just a reference. In any case, thanks so much for your replies. I don’t want to take up any more of your time on this.

1 Like

good luck @aezazi

just to let you know @.py file still doesn’t open. hope your issue is resolved.