Training Process lesson - Why Tokenize two times

I have a doubt on the code in this lesson. Appreciate if someone clarifies. In this lesson we use tokenize_and_split_data from utilities package to create a tokenized train_dataset and test_dataset. We pass this data in the inference function defined, where we use the "EleutherAI/pythia-70m" tokenizer on this already tokenized data. Why do we need to do this? Is this correct? Appreciate if someone can help me understand.

1 Like

Yes, even I feel this creates requiring more memory!!

Basically what I understood they use GPT and then ChatGPT to create a fine-tuned LLMs !!! Even I want to know answer for this question you have raised. I believe this takes up large amount of memory(GB) which I would not term it as fine-tuning

Good question. Also, I don’t know how to install the utilities package locally. Do you know how it works? pip install utilities does not work. Is there another way to access the tokenize_and_split_data function? Any help is much appreciated!

If you open the notebook you will find the utilities.py file and you can download it, see the attached image,

Super, thanks!!

I guess you got the solution to your problem. If not, please let me know.