My question is: how do we handle accents, tildes and other similar symbols?
Do we have a different token for every combination? What about other languages (Arabic, Japanese, …)? Or even symbols like emojis or “! # ? :)”, etc? Do you just filter them out? Or are there tokens for all of them?
1 Like
Depends on how the model is trained, if it has been trained on other languages and those symbols, it will recognise them. If not, then it will use a probabilistic approach to relate them to a known pattern that it has been trained on, or if it has been trained with unknown variables they will be assigned to unknown.
1 Like
Can you elaborate on GPT-series for example?
1 Like
No, I am giving you the generic working principles of language models! I dont know the training dataset of GPT.
1 Like
How about any example regarding the probabilistic approach for relating them to a known pattern?
1 Like
Yes good question, check the Deep Learning Specialization and the Natural Language Specialization they will give you insights of these model internal working, from scratch!
1 Like
I have gone though those specializations and have an understanding about their internal working mechanisms. Are you particularly, referring to the tokenization parts?
Yes tokenization is the one that deals with unknown symbol’s! Also check how embeddings are created.