Hi, if I write a prompt say in French, does the chatbot first translate it into English and then tokenize it?
No it wont do so. If the chatbot has been trained in French will recognize it otherwise wont respond, at least meaningfully.
So ChatGPT’s training dataset included Finnish language material as I can give ChatGPT prompts in Finnish language?
Most models, at least the ones that have a paper published alongside, will tell you the dataset that they were trained on. For example, you can see BLOOM 176B has training data of 46 languages: Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
But don’t get too excited, since the details show that some of the other languages may have very small amount of data and so the model might not really learn that language that much.
I have given ChatGPT prompts in Finnish language and seen that it hallucinates a lot. Is that because the amount of Finnish training material is relatively small compared to “bigger” languages like English?
Most probably it is so.