Why build a rewards model in RLHF?

I have a hard time understanding the concept of a rewards model in RLHF. The theory says that initial model is given a scalar feedback using a rewards model to align the model with specific policies encoded in an already trained rewards model. To train the rewards model itself, a decent number of human annotated examples is needed.

No matter how good the rewards model is, its accuracy will always be less than the actual human feedback. Also, since rewards model size should be comparable to the initial model size for it to work effectively, it means that running a good rewards model itself is going to be very expensive.

Is there some kind of minimum baseline accuracy that a rewards model should have to be considered useful? How quickly initial LLM model degrades with decrease in accuracy of rewards model?

Is it possible that getting more human generated feedback and plugging it directly as feedback to initial model might actually be less expensive than building a rewards model?

Good question!

In regards to the baseline accuracy, I think the baseline accuracy should be dependent on the domain/use case and the risks associated with the model behaving incorrectly. Also it is fair to assume that as the rewards model’s accuracy decreases, the quality of the feedback provided to the initial model also decreases, potentially leading to suboptimal learning.

In view of your last paragraph, direct human feedback could indeed provide more accurate guidance to the initial model, but the feasibility might be constrained by factors like the availability of human experts, the scale of data, time sensitivity of the task, etc. Hybrid approaches could be a recommended, where initial training or fine-tuning is done with a reward model, followed by iterative refinements based on direct human feedback.