Training Process lesson - Why Tokenize two times

vsrinivas · August 27, 2023, 12:43pm

I have a doubt on the code in this lesson. Appreciate if someone clarifies. In this lesson we use tokenize_and_split_data from utilities package to create a tokenized train_dataset and test_dataset. We pass this data in the inference function defined, where we use the "EleutherAI/pythia-70m" tokenizer on this already tokenized data. Why do we need to do this? Is this correct? Appreciate if someone can help me understand.

Deepti_Prasad · August 27, 2023, 3:51pm

Yes, even I feel this creates requiring more memory!!

Basically what I understood they use GPT and then ChatGPT to create a fine-tuned LLMs !!! Even I want to know answer for this question you have raised. I believe this takes up large amount of memory(GB) which I would not term it as fine-tuning

Lukas_Brull · August 27, 2023, 5:13pm

Good question. Also, I don’t know how to install the utilities package locally. Do you know how it works? pip install utilities does not work. Is there another way to access the tokenize_and_split_data function? Any help is much appreciated!

Adam_Hjerpe · August 27, 2023, 5:36pm

If you open the notebook you will find the utilities.py file and you can download it, see the attached image,

Lukas_Brull · August 27, 2023, 5:40pm

Super, thanks!!

vsrinivas · August 28, 2023, 7:06am

I guess you got the solution to your problem. If not, please let me know.

Topic		Replies	Views
Error in running tokenize_and_split_data Finetuning Large Language Models	1	235	April 22, 2024
04_Data_preparation_lab_student - tokenize_function Finetuning Large Language Models	0	122	February 7, 2024
[SOLVED] Potential issue with tokenize_function in week2 lab Generative AI with Large Language Models week-2	1	133	May 26, 2024
Lesson 5 dataset preparation - first example adds prompt text but the final function does not Finetuning Large Language Models	0	11	January 2, 2025
Limitations of Pythia70M Finetuning Large Language Models	0	133	October 19, 2023

Training Process lesson - Why Tokenize two times

Related topics