How to decide the tokenizer?

devang_pagare · October 20, 2023, 3:16am

While AutoTokenizer is a convenient way to identify and load tokenizers for various pretrained models, there are cases, like with models based on BERT, where using a specific tokenizer class is necessary. For instance, you might use DistilBertTokenizerFast as shown below:

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained(“distilbert-base-uncased”)

So how do we know whether to use AutoTokenizer or use specific class? And how to know the name of that specific class?

balaji.ambresh · October 20, 2023, 5:28am

AutoTokenizer will return the correct instance of the tokenizer based on the model you specify. The model page like distillbert-base-uncased will also contain the name of the tokenizer to use in an example.
Use the type(object) as well to learn about the tokenizer instance.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
>>> type(tokenizer)
transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast

Here’s where you manually specify the tokenizer:

>>> from transformers import DistilBertTokenizerFast
>>> tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
>>> type(tokenizer)
transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast

Here’s documentation on using the *Fast tokenizers.

devang_pagare · October 20, 2023, 10:09am

Thanks, understood now.

Topic		Replies	Views
Processor and Tokenizer GenAI with LLMs Resources	1	713	September 8, 2023
Need help to find the syntax of transformers and tokenizers used in week 1 lab Generative AI with Large Language Models week-1	3	410	July 30, 2023
Using pre-trained tokenisers and Embedding layers NLP with Attention Models week-1 , week-3	6	261	April 12, 2024
Is the tokenizer a model? Generative AI with Large Language Models week-1	1	478	September 8, 2023
Fine Tuning BERT collab NLP with Attention Models week-3	1	473	May 30, 2023

How to decide the tokenizer?

Related topics