A general question about LLM tokenization

Seyed_Saeed_Changiz · December 13, 2023, 4:49am

My question is: how do we handle accents, tildes and other similar symbols?

Do we have a different token for every combination? What about other languages (Arabic, Japanese, …)? Or even symbols like emojis or “! # ? :)”, etc? Do you just filter them out? Or are there tokens for all of them?

gent.spah · December 13, 2023, 6:58am

Depends on how the model is trained, if it has been trained on other languages and those symbols, it will recognise them. If not, then it will use a probabilistic approach to relate them to a known pattern that it has been trained on, or if it has been trained with unknown variables they will be assigned to unknown.

Seyed_Saeed_Changiz · December 13, 2023, 7:18am

Can you elaborate on GPT-series for example?

gent.spah · December 13, 2023, 7:20am

No, I am giving you the generic working principles of language models! I dont know the training dataset of GPT.

Seyed_Saeed_Changiz · December 13, 2023, 7:36am

How about any example regarding the probabilistic approach for relating them to a known pattern?

gent.spah · December 13, 2023, 7:37am

Yes good question, check the Deep Learning Specialization and the Natural Language Specialization they will give you insights of these model internal working, from scratch!

Seyed_Saeed_Changiz · December 13, 2023, 7:03pm

I have gone though those specializations and have an understanding about their internal working mechanisms. Are you particularly, referring to the tokenization parts?

gent.spah · December 14, 2023, 11:59am

Yes tokenization is the one that deals with unknown symbol’s! Also check how embeddings are created.

Topic		Replies	Views
[ELI5] What is embedding? Generative AI with Large Language Models week-1	5	544	December 5, 2023
The uses of Tokenizer Generative AI with Large Language Models week-1	1	379	October 2, 2023
How to handle new Tokens Generative AI with Large Language Models week-2	3	360	September 8, 2023
A large language model at character level AI Discussions	0	116	April 29, 2024
A Character Based Language Model NLP with Attention Models week-4	1	257	May 1, 2024

A general question about LLM tokenization

Related topics