While AutoTokenizer is a convenient way to identify and load tokenizers for various pretrained models, there are cases, like with models based on BERT, where using a specific tokenizer class is necessary. For instance, you might use DistilBertTokenizerFast as shown below:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained(“distilbert-base-uncased”)
So how do we know whether to use AutoTokenizer or use specific class? And how to know the name of that specific class?
AutoTokenizer will return the correct instance of the tokenizer based on the model you specify. The model page like distillbert-base-uncased will also contain the name of the tokenizer to use in an example.
Use the type(object)
as well to learn about the tokenizer instance.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
>>> type(tokenizer)
transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast
Here’s where you manually specify the tokenizer:
>>> from transformers import DistilBertTokenizerFast
>>> tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
>>> type(tokenizer)
transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast
Here’s documentation on using the *Fast tokenizers.