Clarification on Optional video: Proximal policy optimization

zyzhang1130 · July 4, 2023, 7:39am

the MDP formulation introduced in this video is a bit different from what I understood from the instructGPT paper: I think in that paper the reward model doesn’t actually evaluate the quality of an incomplete response, i.e., the response is only assigned with a reward when a complete sentence is formed. I think this understanding is aligned with how the reward model is trained, i.e., the reward model is not trained to assign a reward to an incomplete response.

Topic		Replies	Views
I have a question about the content of the lecture Generative AI with Large Language Models week-3	0	401	August 14, 2023
Week 3: Video RLHF Reward Model Generative AI with Large Language Models week-3	0	316	November 18, 2023
There might be one error regarding the loss function in the slice on page 21 Generative AI with Large Language Models week-3	1	413	August 8, 2023
The magic reward model? Generative AI with Large Language Models week-3	7	536	July 11, 2023
Does reward model need retraining with domain specific inputs? Generative AI with Large Language Models week-3	2	301	November 5, 2023

Clarification on Optional video: Proximal policy optimization

Related topics