How to apply NLP tools like tokenization and embedding on other languages apart from English?

Geetanjali_Srivastav · September 2, 2023, 1:42pm

My doubt is how can I apply NLP tools like tokenization, embedding etc on other languages apart from English?

arvyzukai · September 6, 2023, 8:12am

If you want to use the already trained models like BERT (for transfer learning, fine-tuning, etc.) you have to use the tokenization procedure they used.

But if you want to train your own model, you can create your own tokens, specific to your language. There are different ways of doing that, but one of the most popular is Byte pair encoding (BPE).
Here is an example of how to do that. And here you could find out more about tokenization impact on Turkish language model.

Cheers

Topic		Replies	Views
Using pre-trained tokenisers and Embedding layers NLP with Attention Models week-1 , week-3	6	261	April 12, 2024
A general question about LLM tokenization Generative AI with Large Language Models week-2	7	337	December 14, 2023
[ELI5] What is embedding? Generative AI with Large Language Models week-1	5	554	December 5, 2023
Subword Text Encoding for other languages than English Natural Language Processing in TensorFlow week-2 , week-3 , week-4	2	515	March 12, 2023
Text embeddings Understanding and Applying Text Embeddings with...	4	160	March 29, 2024

How to apply NLP tools like tokenization and embedding on other languages apart from English?

Related topics