Is it a typo in the loss function of Reward model in Week3?

As far as I know, the loss function of reward model should be negative log probability rather than positive log probability.

I think that is self-understood, right!

In the InstructGPT paper they wrote it correctly as negative log-sigmoid:

But in the Stiennon et al. 2020 they wrote loss=log(sigmoid(r_j-r_k)) but interpreted it as “j better than k”.

It’s self-understood for experienced learners. But it’s a mistake for sure.

Sorry! my fault. I should add a question mark, b/c I just suspect it.
Intuitively, we want r_j > r_k, i.e., maximize the log probability of (r_j - r_k). That’s why I suspect the loss function should be negative.
Besides, in the Stiennon et al. 2020, sec 3.4, the author wrote the loss function:
image
However, in the paper, Figure 2, image. It’s really confuse me. Hopefully someone can answer my doubts. :pray: