04_Data_preparation_lab_student - tokenize_function

Arindam_Dey · February 7, 2024, 3:37pm

Once I tokenize the dataset , shouldn’t all the samples have the same length ? However in the tokenize_function, I noticed that , in each of the passed example we are setting the max_length based on the minimum of the example or 2048. So if we have two examples of length 100 and 150 , they’ll have different length after the function processes them.

max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )

Please help
Thanks
Arindam Dey

Topic		Replies	Views
Can I get a hint! Max len of the longest tweet NLP NLP with Sequence Models week-module-1	2	372	January 8, 2024
Packing the data with max sequence length Pretraining LLMs ai-discussions	3	61	December 19, 2024
How to to determine max_length? Natural Language Processing in TensorFlow week-module-2 , week-module-3 , week-module-4	3	864	April 6, 2022
C1_W1 Practic Assignment: Calculate the length of the longest tweet NLP with Sequence Models week-module-1	6	251	August 6, 2024
[SOLVED] Potential issue with tokenize_function in week2 lab Generative AI with Large Language Models week-module-2	1	138	May 26, 2024

04_Data_preparation_lab_student - tokenize_function

Related topics