Building a RAG Application for Native Language Documents

I am trying to build a Retrieval-Augmented Generation (RAG) application tailored for documents in my native language. My goal is to develop a robust RAG application that can effectively understand and generate content in my native language. I have a general understanding of the RAG architecture and its potential applications, but I am looking for specific guidance on adapting it to non-English languages.

Here are a few key points I’d appreciate assistance with:

  1. Language-specific challenges: Are there any language-specific challenges or considerations that I should be aware of when implementing RAG for a non-English language?
  2. Training data: What are the best practices for curating training data in a native language? Are there any publicly available datasets or resources that could be particularly helpful?
  3. Fine-tuning models: Are there recommended approaches for fine-tuning pre-trained models like Mistral, LLAMA2, or others for a specific language? Any tips on optimizing model performance?
  4. Evaluation metrics: What evaluation metrics are most suitable for assessing the performance of a RAG model in a non-English context? Are there any language-specific nuances to consider?
1 Like

Hi @dawanse and @Deepti_Prasad !
Here is what I know and would check:

  1. Using proper embedding models which perform higher in the local languages that you are going to create RAG application for. Check for leaderboards like this one on HF. The better the embeddings for your documents, the better the retrieval results will be.
  2. You can checkout all the open source datasets on HF again. For example, check these ‘Nepali’ datasets.
  3. My first approach would be not to fine-tune LLM on a language rather build a good retrieval system first and check the open-source LLMs performance on it. Most of these LLMs are already trained on massive datasets which often include local languages. Although I agree that they perform poorly as compared to English. You can check for Instruction fine-tuning here and would recommend this paper on their findings to fine-tune LLM on local language.
  4. I am not aware of any evaluation metrics for RAG systems. But someone share me some resources that I have not read thoroughly. Check out this work and their codebase for evaluating RAGs.

Do share your progress here! I would like to know how it went.
Additional resource: A useful repo to check if your docs are getting retrieved fro the correct search space that you want them to be retrieved from: GitHub - gabrielchua/RAGxplorer: Open-source tool to visualise your RAG 🔮

I didn’t ask for this :face_with_peeking_eye: :thinking:

or did I :face_with_monocle: :nerd_face:

You had liked this post, I thought, you might be interested in this and thought of notifying you :slight_smile:

Oh ok thank you @jyadav202 :slight_smile:

Will surely check. thanks again.

Regards
DP

I would also tag about LAMINI if in case talking about fine-tuning LLM models.

I really like the approach in this model, when I did the course, I think out of all LLMs short courses I like LAMINI.

Regards
DP

Hi @jyadav202 . I need your help. Im interested to build my mother language q/a bot using my mother language llm… There are many finetujed llms are there on my mother languages. But i have doubt that these vector databases accept documents rather than English? Also will rag works on that?

1 Like

The vector datastore accepts vectorized text. So as long as you use an embedding model which can vectorize your the text , it should be good.

Amazing!!!

Tq @jyadav202 . Could you help me for that by providing any resources related to that? Really I’m making this an final year project. Whats your opinion and my mother language is Telugu!!

This is the list of models pretrained/finetuned on languages including Telugu on HF that I could find.

Tq @jyadav202 . So first I need to embed my whole pdf and store that in vector database. Then I need to use rag right

1 Like