I think the reward model has to be pretrained with the set of {{completions1, completions2}, {human label 1, human label2}}
, right?
Hey @saileshbaidya,
Yes, in reinforcement learning and imitation learning, a reward model is often pretrained using a dataset of completions and corresponding human labels to guide the model’s behavior effectively.
Cheers!
Jamal
Awesome, thanks!