As far as I know, the loss function of reward model should be negative log probability rather than positive log probability.
I think that is self-understood, right!
In the InstructGPT paper they wrote it correctly as negative log-sigmoid:
But in the Stiennon et al. 2020 they wrote loss=log(sigmoid(r_j-r_k)) but interpreted it as “j better than k”.
It’s self-understood for experienced learners. But it’s a mistake for sure.
Sorry! my fault. I should add a question mark, b/c I just suspect it.
Intuitively, we want r_j > r_k
, i.e., maximize the log probability of (r_j - r_k)
. That’s why I suspect the loss function should be negative.
Besides, in the Stiennon et al. 2020, sec 3.4, the author wrote the loss function:
However, in the paper, Figure 2, . It’s really confuse me. Hopefully someone can answer my doubts.