Hi there,
I’m a little bit confused about the input for the reward model in RLHF. According to the course video, this model was trained by “two” prompt-completions-pair(preferred and not preferred),
So the reward model have different kinds of pairs as input? Or the reward model is performing like a regression model of reward? Probably I misunderstood something, any response would be helpful, thx for your time!
I am also very confused about this… how is it that we get a reward for both prompts if the reward model (for instance BERT) is supposed to get both prompts at the same time?
Thank you for the response! But what I’m confused about is that if we train the model based on two prompt-completion pairs, doesn’t it also need to be the same “two” pairs for inference? Or does the BERT(reward) model just output the same amount of reward as the input pair amount without any adjustment?
we can see that, during training the reword model uses the two prompt-completions pairs to caculate the logits r_j and r_k (it uses the same architecture twice) and then we caculate the loss log(σ(r_j - r_k ))
And during inference, that trained architecture is employed to caculate the reword