LLM has knowledge cut-off problems. However, assuming there is no need to change the architecture of the LLM, why not to apply new data and continue to train the model progressively? Let’s say a GPT model has been trained with data until 2022. This means either the training process has converged or the number of steps has reached a threshold. With new data, we can expect the training error will be greater than before, but continuing the training process should decrease this error. From the lecture it looks like we need to retrain the model from scratch which I cannot understand.
Could you video timestamp where this doubt arise for you?
In video “Using the LLM in applications” of W3 (also see slides 90), at time 2’58", it mentioned:
RAG is a great way to overcome the knowledge cutoff issue and help the model update its understanding of the world. While you could retrain the model on new data, this would quickly become very expensive. And require repeated retraining to regularly update the model with new knowledge.
What does retraining mean here? Is it to retrain the model from scratch by resetting the parameters to random numbers, or continue to train the model with the current parameters, using the new data? If the latter approach is used, the cost should be incremental.
The RAG method works, but if similar queries are requested by different users again and again, then every time the new data will be part of the input to the LLM, wasting inference resources. Further, it may not work if the new data needed is big.
That makes sense. This is a version of what is called “Transfer Learning” and you don’t have to start from scratch: you can also start from the current trained weights and then do incremental training on the new data. Or perhaps the augmented data corpus (original training data plus the new incremental data). As they say, doing that is a lot less expensive in terms of time and compute resources than training again from scratch.
So I think this just shows that your original question:
was based on an incomplete interpretation of what was said. You could do that, but you don’t have to do that. They specifically also described the option of incremental training and said that was likely to be more efficient.
But then the course is apparently teaching you about yet another technique called RAG, but I do not know anything about that.
Thanks for the explanation. From the lecture, RAG does not update the LLM with the new data. Instead, the relevant information extracted from the new data is used as prompt. ChatGPT updates its LLM almost yearly, when a new model is built. I am not sure whether transfer learning can be used if we add new data progressively, but there is at least one option:
If the LLM architecture does not need to be changed, why not keep the parameters that were trained from the current data set, add the new data into this data set, and then continue the same training process daily, or at least weekly, in order to incrementally update the LLM over time? Although all data will be used in each training epoch, the weights are already near optimal. Is this method still very expensive, or does the transformer architecture not allow incremental update in this way?
I don’t know how to evaluate the relative cost of that method, but any neural network architecture should in principle allow this type of incremental training. There should be nothing about the transformer architecture that makes this impossible. Whether it is really practical is a separate question.
This is a very interesting issue. Early search engines rebuilt the search index per week, then it becomes daily and even instantly for news and social media. But it looks ChatGPT cannot do it currently. Maybe daily or even weekly updates are still too expensive to proceed continuously. I hope there can be better methods to solve this.