In the notebook for 04_Data_preparation_lab_student we have the following code:
def tokenize_function(examples):
...
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True,
)
max_length = min(
tokenized_inputs["input_ids"].shape[1],
2048
)
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
return tokenized_inputs
Why is the tokenizer being called twice? Why can we not use the following code instead using single call to tokenizer?
tokenizer.pad_token = tokenizer.eos_token
tokenizer.truncation_side = "left"
max_length = 2048
# input with > 2048 tokens will be truncated at the left side to 2048 tokens
# input with < 2048 tokens will remain unchanged
# even padding=True is not needed, if this is not being batched while mapping to dataset
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)