"Avoid using tokenizers before the fork"?

kronsteen · March 27, 2024, 10:51pm

Running locally on M2 MacBook Air. Following along with the instructions in Lesson 3: Translation and Summarization.

What does this mean?

Avoid using tokenizers before the fork if possible

Not really sure what path to take following this warning:

Deepti_Prasad · March 28, 2024, 4:31pm

the above warning is given to avoid fasttokenizer.

This could be done either by setting environment variable to the string “false” or
TOKENIZERS_PARALLELISM=false

Check the below link

Although I am also tagging an NLP mentor if he could add anything more. @arvyzukai can you please have a look at this doubt. thank you.

Regards
DP

arvyzukai · March 29, 2024, 9:05am

Hi,

The Allohvk’s answer in the stackoverflow link pretty much explains everything. I’m not familiar with the code (as I understand - the short course notebook) in details (which might matter), but what comes to mind is that you probably called the tokenizer prior to training (you might have tried it to see what the outputs of the tokenizer are in your notebook - you wouldn’t want that in “real” training).
In other words, you can ignore the warning when learning, but if you want the efficiency (for production) you should pay attention and adapt the code accordingly.

Cheers

kronsteen · March 29, 2024, 5:29pm

Since I am learning all this for the first time, it’s confusing to get (what looks like) a serious warning when using the provided course notebook.

but what comes to mind is that you probably called the tokenizer prior to training

I only cut and pasted whatever was in the course notebook.

if you want the efficiency (for production) you should pay attention and adapt the code accordingly

Of course I’d prefer to write my own code without inefficiencies, but then it’s confusing that the course uses “bad practices” - if that’s what is happening.

Now that I’m further into the course, I realize that this course is only a survey, so I’m not stressing about the warnings. Perhaps there are other in-depth courses that explain how to do just one method - using “accurate code”? I appreciate your recommendations - thanks!

arvyzukai · April 8, 2024, 6:20am

Again, I’m answering “blind” here (I don’t see the Notebooks) but, generally there are no courses that dive deep into some implementations (for example, some particular hugging face model implementation) because the landscape is changing too fast (creating courses takes time, attracting learners takes time, suddenly the new trending topic comes along and the course would become irrelevant).
Usually, you would carefully read the hugging face documentation and check the source code for methods that you care (especially when documentation is not very helpful). Sometimes the specifics are framework related, then posts like the Deepti linked on stackoverflow are the ones to look for. If that does not answer the question, you would dig deeper in the PyTorch forums.
As for developing “accurate code” (broadly speaking) you would look for language related courses.
In other words, I wish I had a better answer, but this is what comes to mind.

Cheers

Topic		Replies	Views
Tokenization in post-training slide Fine-tuning & RL for LLMs: Intro to Post-training week-module-2 , dl-ai-learning-platform	0	42	November 2, 2025
Load model directly Open Source Models with Hugging Face	2	35	July 19, 2024
The uses of Tokenizer Generative AI with Large Language Models week-module-1	1	386	October 2, 2023
HuggingFace Tokenizers: Where are they? Sequence Models coursera-platform	2	533	December 31, 2021
Training Process lesson - Why Tokenize two times Finetuning Large Language Models	5	178	August 28, 2023

"Avoid using tokenizers before the fork"?

Related topics