Question about reward model in RLHF

aym3j · January 6, 2024, 7:38pm

from the paper mentioned in the slides “Learning to summarize from human feedback” Figure 2 (page 4)

we can see that, during training the reword model uses the two prompt-completions pairs to caculate the logits r_j and r_k (it uses the same architecture twice) and then we caculate the loss log(σ(r_j - r_k ))

And during inference, that trained architecture is employed to caculate the reword

Topic		Replies	Views
Week 3 general question Generative AI with Large Language Models	3	43	December 1, 2024
Week 3: Video RLHF Reward Model Generative AI with Large Language Models week-3	0	316	November 18, 2023
RLHF: how many labeler results per prompt are input to reward model? Generative AI with Large Language Models	3	17	February 13, 2025
Question on the loss function of reward model Generative AI with Large Language Models week-3	1	53	July 15, 2024
I have a question about the content of the lecture Generative AI with Large Language Models week-3	0	403	August 14, 2023

Question about reward model in RLHF

Related topics