I have a hard time understanding the concept of a rewards model in RLHF. The theory says that initial model is given a scalar feedback using a rewards model to align the model with specific policies encoded in an already trained rewards model. To train the rewards model itself, a decent number of human annotated examples is needed.
No matter how good the rewards model is, its accuracy will always be less than the actual human feedback. Also, since rewards model size should be comparable to the initial model size for it to work effectively, it means that running a good rewards model itself is going to be very expensive.
Is there some kind of minimum baseline accuracy that a rewards model should have to be considered useful? How quickly initial LLM model degrades with decrease in accuracy of rewards model?
Is it possible that getting more human generated feedback and plugging it directly as feedback to initial model might actually be less expensive than building a rewards model?