The magic reward model?

The reward model seems to be a bit of magic to me. One of the criteria for a good model is honesty. How could comparatively simple reward model be able to a better judge of what is honest than the much larger LLM?

Even with the tens of thousands of human input to create the reward model, I can’t see how it would have enough information know it. And how could it generalise those assessments to other prompts given that the completions mostly seems to make sense.

One possible answer to your question:

Yes, it could be better to have humans do the rewards, instead of a “simple reward model”. However, the cost and logistics to have the number of humans needed to be able to do this on a very LLM is not practical.

For this reason, the Rewards Model is a very good proxy to solve this problem. The Rewards Model provides a very good response and it can be created, re-used, and it can scale, at reasonable costs.

Like someone said a long time ago “Perfect is enemy of good”. In this case, perfect may be not viable, so we have a very good solution.


1 Like

It is mentioned in the course that smaller models can be suitable for “narrow” tasks. In this case we only need a single result from the reward model, as opposed to the ability to do possibly multiple complex tasks well (as for the LLM). As less complex models need less data to train in general, using simpler models for the reward could be a practical trade-off.

I definitly see the the point of transfering the bulk work to a model instead of manually setting the rewards. I just have a hard time accepting that the LLM is based on milliards of real entries and the reward model is based on much fewer still is able to just how well it’s performing. Couldn’t it easily be fooled? To some part this is the reward hacking that is mentioned, but it could be more subtle.

I have heard that a reward model could be based on a pretrained LLM, would that be more acceptable?

Not sure if sets my mind at ease. It’s kind of the blind guiding the blind. Especially since I assume the model being trained is supposed to be better than anything else available. This would not be the case when reducing the model complexity for inference, but I guess alignment is done on the full model rather than a reduced model.

Why is it the blind guiding the blind? More specifically, why is the reward model blind? The reward model is LLM based and is trained on the human feedback data.

Did you come across any reward model that was evaluated to be performing very poorly on the human feedback data, and still someone used it? If so, how poor was that evaluation result? If not, how do you justify your worry?


First thing to note is that the LLMs objective is to model P(x_t|x_{t-1},\cdots,x_0) where the probability for the next token is over the entire vocabulary . This is much more complex than the reward model where it models P(y|(x_0,\cdots,x_T)) where y is binary. The former is much, much harder to solve than the latter. Therefore, requires millions of examples to train.

Any deep learning model can be fooled, no matter how complex it is. There is a separate branch of study on it.

As @Juan_Olano pointed out, all these are just (economically) viable solutions (or techniques) that are far from perfect. Deep learning is a game of gradients, still a black box.