04_Data_preparation_lab_student

arupc · February 27, 2026, 7:40am

In the notebook for 04_Data_preparation_lab_student we have the following code:

def tokenize_function(examples):
    ...

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

Why is the tokenizer being called twice? Why can we not use the following code instead using single call to tokenizer?

    tokenizer.pad_token = tokenizer.eos_token 
    tokenizer.truncation_side = "left"
    max_length = 2048 
    # input with > 2048 tokens will be truncated at the left side to 2048 tokens
    # input with < 2048 tokens will remain unchanged
    # even padding=True is not needed, if this is not being batched while mapping to dataset 
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

Topic		Replies	Views
04_Data_preparation_lab_student - tokenize_function Finetuning Large Language Models	0	135	February 7, 2024
Tokenizer Error on batched=True When Using Different Cloud Service Generative AI with Large Language Models week-module-2	1	538	May 12, 2024
C5 W4 Lab 2 and 3, tokenizer Sequence Models coursera-platform	1	552	August 9, 2021
Training Process lesson - Why Tokenize two times Finetuning Large Language Models	5	195	August 28, 2023
C3W3 seq_pad_and_trunk Natural Language Processing in TensorFlow	4	330	October 17, 2022

04_Data_preparation_lab_student

Related topics