Why Log Sigmoid log(σ(r_j - r_k)) as loss function to train reward model?

Hello Gen AI community,

I’m struggling to understand why this loss function is used to train the reward model. I’m referring to the “RLHF: Reward model” video at min 1:00

Given we want to train the model to favor completion y_j as opposed to y_k, my understanding is that we want to maximize y_j - y_k. So we want to get σ(r_j - r_k) as close to 1 as possible. But log(σ(r_j - r_k)) goes from -∞ to 0 so minimizing it seems to me like it yields the opposite effect of what we want.

I would understand if we were let’s say taking log(σ(r_k - r_j)) or taking -log(σ(r_j - r_k))

Anyone able to clarify this for me?

Cedric

5 Likes

I think we can find support to one of our suggestions from the paper listed in the lower left corner of those slides.

1 Like

That’s a good observation @cedricvidal. In the paper the loss is defined as -E_{x\sim D}[log(\sigma(r_j-r_k))] where x is a summary input.

1 Like

Yes, I looked at the paper but I’m not familiar with the E_{x\sim D}[f] notation, could you explain to me? Especially what E, D and what is the relation between E and what’s in the brackets?

Note: how do you write equations in a post here?

1 Like

The loss has to be computed for all samples in the dataset. Therefore, in practice, it is just an expectation (average) over the loss computed for each sample (summaries) in the dataset D. Therefore, you can move this negative sign inside the E[ \cdot ], which result in -\log(\cdot) for individual sample.

To write equations, just use Tex code inside the dollar signs like $E_{x\sim D}$, this will be displayed as E_{x\sim D}

1 Like

yes, i believe the course slide should be updated to indicate that the loss (being minimized) is the negative log sigmoid of the rj - rk

2 Likes

Agree to this point.

I think the course material is coming from ‘Figure 2’ of the paper

And this seems not aligned with the mathematical formula in page 6 of the paper, where there is a negative sign there.

3 Likes

If anyone can help, a further question on this.

Why can’t the loss function simply be max(r_j - r_k), or maybe MSE ?

Why do we need a sigmoid or a log applied to it ?

That’s a good question, so starting from max(r_j - r_k), it is not used because of the lack of gradation as It only cares about whether r_j > r_k, not by how much. This doesn’t provide nuanced feedback for learning whereas MSE isn’t ideal because of some couple of reasons:
Scale sensitivity: as it is sensitive to the scale of rewards, which can be arbitrary in preference learning.
Outlier sensitivity: it heavily penalizes large errors, which might not be desirable in preference learning. So In short, In RLHF, we’re often more interested in learning relative preferences rather than absolute reward values. The log-sigmoid loss captures this idea well, providing a balanced approach to learning from pairwise preferences.

Hi Aditya, thank you for the reply. I don’t follow this bit:

so starting from max(r_j - r_k), it is not used because of the lack of gradation as It only cares about whether r_j > r_k, not by how much.

If there is a max operator, why would you say that it doesn’t care
by how much r_j is greater than r_k ?

Hey Malhar, Thanks for bringing this correction. Here i meant to say about in terms of differentiation, as indeed it provides the max value but non differentiable at the same time which can bring the issues for gradient based optimization methods.

1 Like

Ah I understand. Thank you for pointing that out.

Thank you for asking this! I was confused by this too: the missing negation makes much more sense.

I wondered the same thing. I asked ChatGPT, so take this with a pinch of salt, but the intuition seemed sound to me:

The minimising the -log(sigmoid(r_j - r_k)) function is good for a task where we want our model to learn a preference between two options, because once r_j is higher than r_k, the gradient of the function tapers-off towards zero. So the model is learning to reward j more than k for most examples, rather than learning to excessively reward j without a limit for some examples.

My understanding is that this means the model is able to converge on a minimal loss value once it’s as good as it can get at rewarding j more than k.

Conversely, if we were maximising r_j - r_k the model would just keep increasing the reward for j.

This is my understanding at least.

1 Like