I’m struggling to understand why this loss function is used to train the reward model. I’m referring to the “RLHF: Reward model” video at min 1:00
Given we want to train the model to favor completion y_j as opposed to y_k, my understanding is that we want to maximize y_j - y_k. So we want to get σ(r_j - r_k) as close to 1 as possible. But log(σ(r_j - r_k)) goes from -∞ to 0 so minimizing it seems to me like it yields the opposite effect of what we want.
I would understand if we were let’s say taking log(σ(r_k - r_j)) or taking -log(σ(r_j - r_k))
Yes, I looked at the paper but I’m not familiar with the E_{x\sim D}[f] notation, could you explain to me? Especially what E, D and what is the relation between E and what’s in the brackets?
The loss has to be computed for all samples in the dataset. Therefore, in practice, it is just an expectation (average) over the loss computed for each sample (summaries) in the dataset D. Therefore, you can move this negative sign inside the E[ \cdot ], which result in -\log(\cdot) for individual sample.
To write equations, just use Tex code inside the dollar signs like $E_{x\sim D}$, this will be displayed as E_{x\sim D}
That’s a good question, so starting from max(r_j - r_k), it is not used because of the lack of gradation as It only cares about whether r_j > r_k, not by how much. This doesn’t provide nuanced feedback for learning whereas MSE isn’t ideal because of some couple of reasons:
Scale sensitivity: as it is sensitive to the scale of rewards, which can be arbitrary in preference learning.
Outlier sensitivity: it heavily penalizes large errors, which might not be desirable in preference learning. So In short, In RLHF, we’re often more interested in learning relative preferences rather than absolute reward values. The log-sigmoid loss captures this idea well, providing a balanced approach to learning from pairwise preferences.
Hey Malhar, Thanks for bringing this correction. Here i meant to say about in terms of differentiation, as indeed it provides the max value but non differentiable at the same time which can bring the issues for gradient based optimization methods.
Thank you for asking this! I was confused by this too: the missing negation makes much more sense.
I wondered the same thing. I asked ChatGPT, so take this with a pinch of salt, but the intuition seemed sound to me:
The minimising the -log(sigmoid(r_j - r_k)) function is good for a task where we want our model to learn a preference between two options, because once r_j is higher than r_k, the gradient of the function tapers-off towards zero. So the model is learning to reward j more than k for most examples, rather than learning to excessively reward j without a limit for some examples.
My understanding is that this means the model is able to converge on a minimal loss value once it’s as good as it can get at rewarding j more than k.
Conversely, if we were maximising r_j - r_k the model would just keep increasing the reward for j.