As far as I know, the loss function of reward model should be negative log probability rather than positive log probability.

I think that is self-understood, right!

In the InstructGPT paper they wrote it correctly as negative log-sigmoid:

But in the Stiennon et al. 2020 they wrote loss=log(sigmoid(r_j-r_k)) but interpreted it as “j better than k”.

It’s self-understood for experienced learners. But it’s a mistake for sure.

Sorry! my fault. I should add a question mark, b/c I just suspect it.

Intuitively, we want `r_j > r_k`

, i.e., maximize the log probability of `(r_j - r_k)`

. That’s why I suspect the loss function should be negative.

Besides, in the Stiennon et al. 2020, sec 3.4, the author wrote the loss function:

However, in the paper, Figure 2, . It’s really confuse me. Hopefully someone can answer my doubts.