Why build a rewards model in RLHF?

Mayank11 · October 23, 2023, 1:42pm

I have a hard time understanding the concept of a rewards model in RLHF. The theory says that initial model is given a scalar feedback using a rewards model to align the model with specific policies encoded in an already trained rewards model. To train the rewards model itself, a decent number of human annotated examples is needed.

No matter how good the rewards model is, its accuracy will always be less than the actual human feedback. Also, since rewards model size should be comparable to the initial model size for it to work effectively, it means that running a good rewards model itself is going to be very expensive.

Is there some kind of minimum baseline accuracy that a rewards model should have to be considered useful? How quickly initial LLM model degrades with decrease in accuracy of rewards model?

Is it possible that getting more human generated feedback and plugging it directly as feedback to initial model might actually be less expensive than building a rewards model?

lawrence · October 26, 2023, 10:29am

Good question!

In regards to the baseline accuracy, I think the baseline accuracy should be dependent on the domain/use case and the risks associated with the model behaving incorrectly. Also it is fair to assume that as the rewards model’s accuracy decreases, the quality of the feedback provided to the initial model also decreases, potentially leading to suboptimal learning.

In view of your last paragraph, direct human feedback could indeed provide more accurate guidance to the initial model, but the feasibility might be constrained by factors like the availability of human experts, the scale of data, time sensitivity of the task, etc. Hybrid approaches could be a recommended, where initial training or fine-tuning is done with a reward model, followed by iterative refinements based on direct human feedback.

Topic		Replies	Views
Week3 lab, the part given to the reward model using human feedback Generative AI with Large Language Models week-3 , faq	18	266	June 4, 2024
The magic reward model? Generative AI with Large Language Models week-3	7	537	July 11, 2023
RLHF... How? Generative AI for Everyone week-2	2	479	December 5, 2023
Week 3 general question Generative AI with Large Language Models	3	43	December 1, 2024
Quiz - week3 - RLHF reward hacking - end of video quiz - interpretability Generative AI with Large Language Models week-3	7	355	January 18, 2024

Why build a rewards model in RLHF?

Related topics