from the paper mentioned in the slides “Learning to summarize from human feedback” Figure 2 (page 4)
we can see that, during training the reword model uses the two prompt-completions pairs to caculate the logits r_j
and r_k
(it uses the same architecture twice) and then we caculate the loss log(σ(r_j - r_k ))
And during inference, that trained architecture is employed to caculate the reword