Building a RAG Application for Native Language Documents

dawanse · January 2, 2024, 3:24pm

I am trying to build a Retrieval-Augmented Generation (RAG) application tailored for documents in my native language. My goal is to develop a robust RAG application that can effectively understand and generate content in my native language. I have a general understanding of the RAG architecture and its potential applications, but I am looking for specific guidance on adapting it to non-English languages.

Here are a few key points I’d appreciate assistance with:

Language-specific challenges: Are there any language-specific challenges or considerations that I should be aware of when implementing RAG for a non-English language?
Training data: What are the best practices for curating training data in a native language? Are there any publicly available datasets or resources that could be particularly helpful?
Fine-tuning models: Are there recommended approaches for fine-tuning pre-trained models like Mistral, LLAMA2, or others for a specific language? Any tips on optimizing model performance?
Evaluation metrics: What evaluation metrics are most suitable for assessing the performance of a RAG model in a non-English context? Are there any language-specific nuances to consider?

jyadav202 · January 31, 2024, 6:43am

Hi @dawanse and @Deepti_Prasad !
Here is what I know and would check:

Using proper embedding models which perform higher in the local languages that you are going to create RAG application for. Check for leaderboards like this one on HF. The better the embeddings for your documents, the better the retrieval results will be.
You can checkout all the open source datasets on HF again. For example, check these ‘Nepali’ datasets.
My first approach would be not to fine-tune LLM on a language rather build a good retrieval system first and check the open-source LLMs performance on it. Most of these LLMs are already trained on massive datasets which often include local languages. Although I agree that they perform poorly as compared to English. You can check for Instruction fine-tuning here and would recommend this paper on their findings to fine-tune LLM on local language.
I am not aware of any evaluation metrics for RAG systems. But someone share me some resources that I have not read thoroughly. Check out this work and their codebase for evaluating RAGs.

Do share your progress here! I would like to know how it went.
Additional resource: A useful repo to check if your docs are getting retrieved fro the correct search space that you want them to be retrieved from: GitHub - gabrielchua/RAGxplorer: Open-source tool to visualise your RAG 🔮

Deepti_Prasad · January 31, 2024, 6:47am

I didn’t ask for this

or did I

jyadav202 · January 31, 2024, 6:48am

You had liked this post, I thought, you might be interested in this and thought of notifying you

Deepti_Prasad · January 31, 2024, 6:49am

Oh ok thank you @jyadav202

Will surely check. thanks again.

Regards
DP

Deepti_Prasad · January 31, 2024, 6:53am

I would also tag about LAMINI if in case talking about fine-tuning LLM models.

I really like the approach in this model, when I did the course, I think out of all LLMs short courses I like LAMINI.

Regards
DP

gsvc · March 2, 2024, 6:44am

Hi @jyadav202 . I need your help. Im interested to build my mother language q/a bot using my mother language llm… There are many finetujed llms are there on my mother languages. But i have doubt that these vector databases accept documents rather than English? Also will rag works on that?

jyadav202 · March 2, 2024, 7:06am

The vector datastore accepts vectorized text. So as long as you use an embedding model which can vectorize your the text , it should be good.

ashispalai215 · March 2, 2024, 7:10am

Amazing!!!

gsvc · March 2, 2024, 7:18am

Tq @jyadav202 . Could you help me for that by providing any resources related to that? Really I’m making this an final year project. Whats your opinion and my mother language is Telugu!!

jyadav202 · March 2, 2024, 7:31am

This is the list of models pretrained/finetuned on languages including Telugu on HF that I could find.

gsvc · March 2, 2024, 8:07am

Tq @jyadav202 . So first I need to embed my whole pdf and store that in vector database. Then I need to use rag right

Topic		Replies	Views
RAG vs. Fine-Tuning: Which One Suits Your LLM? AI Discussions ai-discussions	2	130	February 19, 2025
Looking for collaborators for a RAG-based project AI Discussions ai-discussions , project	14	344	August 30, 2024
Instruction tuning for Quiz geenration AI Discussions ai-discussions , langchain , rag , project	1	183	March 8, 2024
RAG for college course catelog AI Discussions ai-discussions , chatgpt , langchain , large-language-model , project	1	319	February 1, 2024
Why and when to use RAG AI Discussions ai-discussions , llm	11	1357	August 28, 2024

Building a RAG Application for Native Language Documents

Related topics