The magic reward model?

One possible answer to your question:

Yes, it could be better to have humans do the rewards, instead of a “simple reward model”. However, the cost and logistics to have the number of humans needed to be able to do this on a very LLM is not practical.

For this reason, the Rewards Model is a very good proxy to solve this problem. The Rewards Model provides a very good response and it can be created, re-used, and it can scale, at reasonable costs.

Like someone said a long time ago “Perfect is enemy of good”. In this case, perfect may be not viable, so we have a very good solution.

Thoughts?

1 Like