I’m unclear on why the reward model is used for reinforcement learning instead of supervised learning? Couldn’t a huge set of completions be created and then scored by a reward model and then be used to retrain the model in a supervised learning form? I notice that constitutional AI partly uses this approach, even if it goes back to RL in the second phase.
I’m assuming that a few epochs of supervised learning would be much less resource intensive than the many iterations of retraining needed in RL.
I would like to refer you to my previous answer to your previous question as I think the answer is similar:
The Constitutional model is one more step in the process of making LLMs better, and in this case, by making sure that the model is aligned with Human Interests and Needs. Still, we face the same problem of scale. So the solution has to be one that is within practical boundaries. I guess over time we will develop better components to achieve these goals. We are just in the infancy of this technology.
My question is related to the choice of algorithm. As far as I know, RL is typically used for multi-step environments where the agent needs to find the way to maximise the “value”. But vanilla RL updates the model every iteration and requires many thousands of iterations for even a simple environment. That’s not a big issue when model updates are just tweaking the expectation values of different actions in some states. When the model updates requires hours, such as for LLMs, that seems un-managable. I assume one could batch the updates so it’s only done every few thousand tests as there are plenty of independent (uncorrelated) prompts to evaluate.
That’s the thinking behind this question. Why not add this to the original loss function for accuracy or do a second parameter optimisation stage in the training phase? The choice of RL is not obvious to me. The need to align with human values is clear. And the infeasibility of doing it completely by hand, I can accept.
It sounds like you are questioning the use of RL in situations where model updates take a long time, such as with LLMs. It’s true that vanilla RL can require many iterations to converge, which can be time-consuming. However, RL can still be a useful approach for multi-step environments where the agent needs to maximize value.
One potential solution to the issue of slow model updates could be to batch the updates so that they are only done every few thousand tests. Another option could be to incorporate the loss function for accuracy into the RL algorithm or to do a second parameter optimization stage in the training phase.
Ultimately, the choice of RL as an approach will depend on the specific problem at hand and the resources available. While there may be other approaches that could be more efficient for certain situations, RL can still be a valuable tool in many cases.
Thanks. I guess in the end my thought of using supervised learning ends up being RL. Since it’s generate prompts → calculate rewards → update LLM → generate prompts → calculate rewards and so on.
I’m thinking that labels exist indirectly, since the reward can be immediately generated when a completion of generated. This is of course only if a reward model is used, not if it’s done manually.
I think I got a better understanding of this after watching the PPO video and getting to the lab. I didn’t really get that the path was the sequence of tokens in the completions.
Above I was mixing my words a bit and used label when I meant score/cost. My thought is that the reward model generates a score and that can be used as a continuous “label” for the training.
To clarify my thoughts on supervised learning, I again used the wrong term. I was actually thinking of pure optimisation, where we would be maximising the helpfulness, non-hatefulness etc.
How does the RL algorithm work here? In an alfworld example or some other training enviroment each potential next token would get an expectation value that gets updated as we progress. But here the algorithm basically gets a complete path and awards it a score. Doesn’t this seem more like optimisation than RL? What am I missing?