I’m struggling to understand why this loss function is used to train the reward model. I’m referring to the “RLHF: Reward model” video at min 1:00

Given we want to train the model to favor completion y_j as opposed to y_k, my understanding is that we want to maximize y_j - y_k. So we want to get σ(r_j - r_k) as close to 1 as possible. But log(σ(r_j - r_k)) goes from -∞ to 0 so minimizing it seems to me like it yields the opposite effect of what we want.

I would understand if we were let’s say taking log(σ(r_k - r_j)) or taking -log(σ(r_j - r_k))

Yes, I looked at the paper but I’m not familiar with the E_{x\sim D}[f] notation, could you explain to me? Especially what E, D and what is the relation between E and what’s in the brackets?

The loss has to be computed for all samples in the dataset. Therefore, in practice, it is just an expectation (average) over the loss computed for each sample (summaries) in the dataset D. Therefore, you can move this negative sign inside the E[ \cdot ], which result in -\log(\cdot) for individual sample.

To write equations, just use Tex code inside the dollar signs like $E_{x\sim D}$, this will be displayed as E_{x\sim D}