Why Log Sigmoid log(σ(r_j - r_k)) as loss function to train reward model?

cedricvidal · July 22, 2023, 4:47pm

Hello Gen AI community,

I’m struggling to understand why this loss function is used to train the reward model. I’m referring to the “RLHF: Reward model” video at min 1:00

Given we want to train the model to favor completion y_j as opposed to y_k, my understanding is that we want to maximize y_j - y_k. So we want to get σ(r_j - r_k) as close to 1 as possible. But log(σ(r_j - r_k)) goes from -∞ to 0 so minimizing it seems to me like it yields the opposite effect of what we want.

I would understand if we were let’s say taking log(σ(r_k - r_j)) or taking -log(σ(r_j - r_k))

Anyone able to clarify this for me?

Cedric

rmwkwok · July 22, 2023, 11:36pm

I think we can find support to one of our suggestions from the paper listed in the lower left corner of those slides.

Arun_Prakash_A · July 23, 2023, 4:26pm

That’s a good observation @cedricvidal. In the paper the loss is defined as -E_{x\sim D}[log(\sigma(r_j-r_k))] where x is a summary input.

cedricvidal · July 23, 2023, 4:53pm

Yes, I looked at the paper but I’m not familiar with the E_{x\sim D}[f] notation, could you explain to me? Especially what E, D and what is the relation between E and what’s in the brackets?

Note: how do you write equations in a post here?

Arun_Prakash_A · July 23, 2023, 5:34pm

The loss has to be computed for all samples in the dataset. Therefore, in practice, it is just an expectation (average) over the loss computed for each sample (summaries) in the dataset D. Therefore, you can move this negative sign inside the E[ \cdot ], which result in -\log(\cdot) for individual sample.

To write equations, just use Tex code inside the dollar signs like $E_{x\sim D}$, this will be displayed as E_{x\sim D}

amiteshwar · December 27, 2023, 7:02pm

yes, i believe the course slide should be updated to indicate that the loss (being minimized) is the negative log sigmoid of the rj - rk

cansong · December 28, 2023, 3:09pm

Agree to this point.

I think the course material is coming from ‘Figure 2’ of the paper

And this seems not aligned with the mathematical formula in page 6 of the paper, where there is a negative sign there.

Malhar · July 31, 2024, 11:03am

If anyone can help, a further question on this.

Why can’t the loss function simply be max(r_j - r_k), or maybe MSE ?

Why do we need a sigmoid or a log applied to it ?

Aditya-Deeplearning · August 1, 2024, 5:21am

That’s a good question, so starting from max(r_j - r_k), it is not used because of the lack of gradation as It only cares about whether r_j > r_k, not by how much. This doesn’t provide nuanced feedback for learning whereas MSE isn’t ideal because of some couple of reasons:
Scale sensitivity: as it is sensitive to the scale of rewards, which can be arbitrary in preference learning.
Outlier sensitivity: it heavily penalizes large errors, which might not be desirable in preference learning. So In short, In RLHF, we’re often more interested in learning relative preferences rather than absolute reward values. The log-sigmoid loss captures this idea well, providing a balanced approach to learning from pairwise preferences.

Malhar · August 16, 2024, 2:16am

Hi Aditya, thank you for the reply. I don’t follow this bit:

so starting from max(r_j - r_k), it is not used because of the lack of gradation as It only cares about whether r_j > r_k, not by how much.

If there is a max operator, why would you say that it doesn’t care
by how much r_j is greater than r_k ?

Aditya-Deeplearning · August 16, 2024, 3:50am

Hey Malhar, Thanks for bringing this correction. Here i meant to say about in terms of differentiation, as indeed it provides the max value but non differentiable at the same time which can bring the issues for gradient based optimization methods.

Malhar · August 17, 2024, 2:46pm

Ah I understand. Thank you for pointing that out.

Peppers16 · September 26, 2024, 6:29pm

Thank you for asking this! I was confused by this too: the missing negation makes much more sense.

I wondered the same thing. I asked ChatGPT, so take this with a pinch of salt, but the intuition seemed sound to me:

The minimising the -log(sigmoid(r_j - r_k)) function is good for a task where we want our model to learn a preference between two options, because once r_j is higher than r_k, the gradient of the function tapers-off towards zero. So the model is learning to reward j more than k for most examples, rather than learning to excessively reward j without a limit for some examples.

My understanding is that this means the model is able to converge on a minimal loss value once it’s as good as it can get at rewarding j more than k.

Conversely, if we were maximising r_j - r_k the model would just keep increasing the reward for j.

This is my understanding at least.

Topic		Replies	Views
Question on the loss function of reward model Generative AI with Large Language Models week-3	1	39	July 15, 2024
W3 - RLHF Reward Model - loss of reward model Generative AI with Large Language Models week-3	1	357	October 1, 2023
Is it a typo in the loss function of Reward model in Week3? Generative AI with Large Language Models week-3	3	398	September 15, 2023
I have a question about the content of the lecture Generative AI with Large Language Models week-3	0	401	August 14, 2023
There might be one error regarding the loss function in the slice on page 21 Generative AI with Large Language Models week-3	1	413	August 8, 2023

Why Log Sigmoid log(σ(r_j - r_k)) as loss function to train reward model?

Related topics