Question about reward model in RLHF

from the paper mentioned in the slides “Learning to summarize from human feedback” Figure 2 (page 4)

we can see that, during training the reword model uses the two prompt-completions pairs to caculate the logits r_j and r_k (it uses the same architecture twice) and then we caculate the loss log(σ(r_j - r_k ))

And during inference, that trained architecture is employed to caculate the reword

2 Likes