Is the tokenizer a model?

Hello everyone,

I notice that the tokenizer returns sometihng that has “input_ids” and “attention_mask.”

  • If I understand things correctly, the input[“input_ids”] is then the prompt in a vector space which we use as an input to the model.

  • Is the “attention_mask” the same attention mask which the class talked about while discussing the “Attention is all you need paper”? So in that sense is the tokenizer a model too?

I’m looking at the documentation of AutoTokenizer linked in the notebook but I don’t understand how does the tokenizer figure out the “input_ids” and what the “attention_mask” is.

Thank You,

Hi @bobi ,

I think you are looking at the Huggingface Transformers library, and specifically looking into the AutoTokenizer.

When you use the HG utils for tokenization, that goes a step ahead of strictly a tokenization. As you very well mention it, it returns already the input_ids and attention_mask of the specific model that you are instantiating. In this case HF is not only tokenizing the input text but it is also processing the embedding and also, depending on the specifics of the particular model, returning the appropriate attention masks.

Now, if we talk strictly about tokenization, not in the context of HF but pure tokenization concept, the input to it would the the text, and the output would be also text, split in particles or tokens. This process will usually be done by a rule-based algorithm, also specific to the model, but you can also find other tokenization processes like a statistical-tokenization, a neural tokenization, etc.