Is the tokenizer a model?

bobi · September 8, 2023, 4:06am

Hello everyone,

I notice that the tokenizer returns sometihng that has “input_ids” and “attention_mask.”

If I understand things correctly, the input[“input_ids”] is then the prompt in a vector space which we use as an input to the model.
Is the “attention_mask” the same attention mask which the class talked about while discussing the “Attention is all you need paper”? So in that sense is the tokenizer a model too?

I’m looking at the documentation of AutoTokenizer linked in the notebook but I don’t understand how does the tokenizer figure out the “input_ids” and what the “attention_mask” is.

Thank You,
Bobi

Juan_Olano · September 8, 2023, 10:58am

Hi @bobi ,

I think you are looking at the Huggingface Transformers library, and specifically looking into the AutoTokenizer.

When you use the HG utils for tokenization, that goes a step ahead of strictly a tokenization. As you very well mention it, it returns already the input_ids and attention_mask of the specific model that you are instantiating. In this case HF is not only tokenizing the input text but it is also processing the embedding and also, depending on the specifics of the particular model, returning the appropriate attention masks.

Now, if we talk strictly about tokenization, not in the context of HF but pure tokenization concept, the input to it would the the text, and the output would be also text, split in particles or tokens. This process will usually be done by a rule-based algorithm, also specific to the model, but you can also find other tokenization processes like a statistical-tokenization, a neural tokenization, etc.

Topic		Replies	Views
Attention mask and token id Finetuning Large Language Models	1	1071	August 26, 2024
Question about Supervised Fine-tuning in Module 3 Lab Fine-tuning & RL for LLMs: Intro to Post-training week-module-3 , dl-ai-learning-platform	5	67	January 16, 2026
Need help to find the syntax of transformers and tokenizers used in week 1 lab Generative AI with Large Language Models week-module-1	3	446	July 30, 2023
The uses of Tokenizer Generative AI with Large Language Models week-module-1	1	394	October 2, 2023
Processor and Tokenizer Generative AI with Large Language Models	1	1021	September 8, 2023

Is the tokenizer a model?

Related topics