I am trying to build a Retrieval-Augmented Generation (RAG) application tailored for documents in my native language. My goal is to develop a robust RAG application that can effectively understand and generate content in my native language. I have a general understanding of the RAG architecture and its potential applications, but I am looking for specific guidance on adapting it to non-English languages.
Here are a few key points I’d appreciate assistance with:
Language-specific challenges: Are there any language-specific challenges or considerations that I should be aware of when implementing RAG for a non-English language?
Training data: What are the best practices for curating training data in a native language? Are there any publicly available datasets or resources that could be particularly helpful?
Fine-tuning models: Are there recommended approaches for fine-tuning pre-trained models like Mistral, LLAMA2, or others for a specific language? Any tips on optimizing model performance?
Evaluation metrics: What evaluation metrics are most suitable for assessing the performance of a RAG model in a non-English context? Are there any language-specific nuances to consider?
Using proper embedding models which perform higher in the local languages that you are going to create RAG application for. Check for leaderboards like this one on HF. The better the embeddings for your documents, the better the retrieval results will be.
You can checkout all the open source datasets on HF again. For example, check these ‘Nepali’ datasets.
My first approach would be not to fine-tune LLM on a language rather build a good retrieval system first and check the open-source LLMs performance on it. Most of these LLMs are already trained on massive datasets which often include local languages. Although I agree that they perform poorly as compared to English. You can check for Instruction fine-tuning here and would recommend this paper on their findings to fine-tune LLM on local language.
I am not aware of any evaluation metrics for RAG systems. But someone share me some resources that I have not read thoroughly. Check out this work and their codebase for evaluating RAGs.
Hi @jyadav202 . I need your help. Im interested to build my mother language q/a bot using my mother language llm… There are many finetujed llms are there on my mother languages. But i have doubt that these vector databases accept documents rather than English? Also will rag works on that?
Tq @jyadav202 . Could you help me for that by providing any resources related to that? Really I’m making this an final year project. Whats your opinion and my mother language is Telugu!!