A RAG based chatbot is a system where we take some information that an LLM doesn’t know the specifics on and we chunk it up and put it into a vector database. Then when you ask a chatbot using this vector database a question, it look for items similar in the database to your question and uses them as sources to generate an answer for you.
I felt it was necessary to give the preface because it could be possible that people might just think I don’t really have an understanding of RAG.
My question: My issue is I don’t want to have to feed the new info from RAG every time to get an answer from my prompt. Is there any way I could just get the LLM to see the new information once and then I can use the LLM unlimited times. What I mean is lets say I want to start a conversation with this chatbot, I do not want to have to again send via the prompt all the new information everytime I create a new session with the chatbot and then ask some question. Could I just have it know all the info and then that’s it?
As far as I could tell I found the answer to be it is not possible, but I’m asking here because I want it to be possible.
If this “new info” is static information that doesn’t get updated frequently, then you can try to fine-tune the LLM on that “new info”. For example, if you want to add support for some special medical knowledge to the LLM (and that medical knowledge is not likely to change), then you can try to fine-tune the LLM with that knowledge. Once the LLM is fine-tuned (the information is “seen once”), then it should keep the medical knowledge over and over again without you needing to include it in the prompt.
If this “new info” is dynamic information that can be updated frequently, such as your past conversations/history with a particular chatbot, then it won’t really make sense for you to fine-tune it. In that case, you will need to use something like RAG, and effectively save your chat history as external information for your chatbot.
ChatGPT has recently launched something similar called Memory. The idea is that the chatbot is able to memorize certain details about the user. I don’t work for them and don’t know how it’s actually implemented, but my guess is that it has a way (likely another neural network) that identifies user preferences, and those user preferences are saved externally (as in the RAG approach) and included as context in any future conversations with the chatbot.
I’d like to clarify. The objective is to reduce costs as much as possible. So with that in mind, it’s obviously quite expensive to finetune a LLM I think. Are there any other alternatives?
I don’t think fine-tuning necessarily costs more, it really depends on your situation.
Which LLM are you using and how big is it? Are you running your own LLM or using some API? How many computers do you have and how powerful are they? How much data are you trying to fine-tune (and do you already have this data?)?
If you’re already have local machines that run your model (and they’re not doing anything else), then it might be ok to fine-tune over several days until the model is ready.
Thank you so much for the advice. I decided to go back and check and I think after running some $$ numbers it does look much more worth it to finetune (probably) than doing some janky RAG thing. One question, does you or (anyone else) know the performance difference between QLoRA and LoRA generally? I know in quantization we lose a bit of precision, so technically would that mean a loss in recall/accuracy/precision/etc other KPI’s? If so, is there any such loss between QLoRA when comparing to LoRA?
Sorry if this question is…dumb lol. I don’t totally understand quantization, so I may have some misconceptions.
Yes, there will likely be some degradation/loss in recall/accuracy/precision when you apply quantization to the model (and even when you use LoRA rather than fine-tuning all the params). Unfortunately, we won’t know how much until you try it out. It also depends on the hyperparameters you use for LoRA.
If computing budget is a concern, I’d recommend you experiment first with a smaller training set for fine-tuning (maybe 10% of all your fine-tuning data, depending on how much data you have). It should be less expensive, but at least you can start to see whether it works, how well it works, and test out a few hyperparameters. Once you’re pretty comfortable with it, you can then commit to training all your data to get a better model.