One possible answer to your question:
Yes, it could be better to have humans do the rewards, instead of a “simple reward model”. However, the cost and logistics to have the number of humans needed to be able to do this on a very LLM is not practical.
For this reason, the Rewards Model is a very good proxy to solve this problem. The Rewards Model provides a very good response and it can be created, re-used, and it can scale, at reasonable costs.
Like someone said a long time ago “Perfect is enemy of good”. In this case, perfect may be not viable, so we have a very good solution.
Thoughts?