Question about reward model in RLHF

Chun_Yang · December 30, 2023, 3:11am

Hi there,
I’m a little bit confused about the input for the reward model in RLHF. According to the course video, this model was trained by “two” prompt-completions-pair(preferred and not preferred),

but in the inference part, the input seems to become only “one” prompt-completion-pair.

So the reward model have different kinds of pairs as input? Or the reward model is performing like a regression model of reward? Probably I misunderstood something, any response would be helpful, thx for your time!

gent.spah · December 30, 2023, 11:37am

Ok, its been sometime for me I have done this course, but here is what I understand:

During training you also train the RLHF model by providing both positive and negative samples.

During inference you are not training the RLHF model, just using it to get a score on its already done previous training.

Zildjian240 · December 30, 2023, 7:06pm

I am also very confused about this… how is it that we get a reward for both prompts if the reward model (for instance BERT) is supposed to get both prompts at the same time?

Chun_Yang · December 31, 2023, 5:55am

Thank you for the response! But what I’m confused about is that if we train the model based on two prompt-completion pairs, doesn’t it also need to be the same “two” pairs for inference? Or does the BERT(reward) model just output the same amount of reward as the input pair amount without any adjustment?

gent.spah · December 31, 2023, 12:01pm

Its trained with 2 pairs during training and during inference it just outputs a value from its training!

aym3j · January 6, 2024, 7:38pm

from the paper mentioned in the slides “Learning to summarize from human feedback” Figure 2 (page 4)

we can see that, during training the reword model uses the two prompt-completions pairs to caculate the logits r_j and r_k (it uses the same architecture twice) and then we caculate the loss log(σ(r_j - r_k ))

And during inference, that trained architecture is employed to caculate the reword

Chun_Yang · January 7, 2024, 10:49am

Thank you so much, appreciate it!

aym3j · January 7, 2024, 8:10pm

You’re welcome

Topic		Replies	Views
Week 3 general question Generative AI with Large Language Models	3	43	December 1, 2024
Week 3: Video RLHF Reward Model Generative AI with Large Language Models week-3	0	316	November 18, 2023
RLHF: how many labeler results per prompt are input to reward model? Generative AI with Large Language Models	3	17	February 13, 2025
Question on the loss function of reward model Generative AI with Large Language Models week-3	1	53	July 15, 2024
I have a question about the content of the lecture Generative AI with Large Language Models week-3	0	401	August 14, 2023

Question about reward model in RLHF

Related topics