"Avoid using tokenizers before the fork"?

Running locally on M2 MacBook Air. Following along with the instructions in Lesson 3: Translation and Summarization.

What does this mean?

Avoid using tokenizers before the fork if possible

Not really sure what path to take following this warning:

1 Like

the above warning is given to avoid fasttokenizer.

This could be done either by setting environment variable to the string “false” or
TOKENIZERS_PARALLELISM=false

Check the below link

Although I am also tagging an NLP mentor if he could add anything more. @arvyzukai can you please have a look at this doubt. thank you.

Regards
DP

Hi,

The Allohvk’s answer in the stackoverflow link pretty much explains everything. I’m not familiar with the code (as I understand - the short course notebook) in details (which might matter), but what comes to mind is that you probably called the tokenizer prior to training (you might have tried it to see what the outputs of the tokenizer are in your notebook - you wouldn’t want that in “real” training).
In other words, you can ignore the warning when learning, but if you want the efficiency (for production) you should pay attention and adapt the code accordingly.

Cheers

Since I am learning all this for the first time, it’s confusing to get (what looks like) a serious warning when using the provided course notebook.

but what comes to mind is that you probably called the tokenizer prior to training

I only cut and pasted whatever was in the course notebook.

if you want the efficiency (for production) you should pay attention and adapt the code accordingly

Of course I’d prefer to write my own code without inefficiencies, but then it’s confusing that the course uses “bad practices” - if that’s what is happening.

Now that I’m further into the course, I realize that this course is only a survey, so I’m not stressing about the warnings. Perhaps there are other in-depth courses that explain how to do just one method - using “accurate code”? I appreciate your recommendations - thanks!

1 Like

Again, I’m answering “blind” here (I don’t see the Notebooks) but, generally there are no courses that dive deep into some implementations (for example, some particular hugging face model implementation) because the landscape is changing too fast (creating courses takes time, attracting learners takes time, suddenly the new trending topic comes along and the course would become irrelevant).
Usually, you would carefully read the hugging face documentation and check the source code for methods that you care (especially when documentation is not very helpful). Sometimes the specifics are framework related, then posts like the Deepti linked on stackoverflow are the ones to look for. If that does not answer the question, you would dig deeper in the PyTorch forums.
As for developing “accurate code” (broadly speaking) you would look for language related courses.
In other words, I wish I had a better answer, but this is what comes to mind.

Cheers