I’m a little bit confused about the input for the reward model in RLHF. According to the course video, this model was trained by “two” prompt-completions-pair(preferred and not preferred),
but in the inference part, the input seems to become only “one” prompt-completion-pair.
So the reward model have different kinds of pairs as input? Or the reward model is performing like a regression model of reward? Probably I misunderstood something, any response would be helpful, thx for your time!
Ok, its been sometime for me I have done this course, but here is what I understand:
During training you also train the RLHF model by providing both positive and negative samples.
During inference you are not training the RLHF model, just using it to get a score on its already done previous training.
I am also very confused about this… how is it that we get a reward for both prompts if the reward model (for instance BERT) is supposed to get both prompts at the same time?
Thank you for the response! But what I’m confused about is that if we train the model based on two prompt-completion pairs, doesn’t it also need to be the same “two” pairs for inference? Or does the BERT(reward) model just output the same amount of reward as the input pair amount without any adjustment?
Its trained with 2 pairs during training and during inference it just outputs a value from its training!
from the paper mentioned in the slides “Learning to summarize from human feedback” Figure 2 (page 4)
we can see that, during training the reword model uses the two prompt-completions pairs to caculate the logits
r_k (it uses the same architecture twice) and then we caculate the loss
log(σ(r_j - r_k ))
And during inference, that trained architecture is employed to caculate the reword
Thank you so much, appreciate it!