Lab 3 Qualitative Evaluation of PPO model; wonky results

gaspardbos · July 20, 2023, 10:12am

The new summaries look wonky. It looks to me like the KL Divergence minimizer did not do a good job to prevent reward hacking. I’ve highlighted two suspects of reward hacking in the first few rows. Similar to this thread: Toxicity mean value increased after detoxification - #5 by gaspardbos I guess it’s because of not enough epochs. Anyone has any other ideas about this?

kandyr · July 24, 2023, 12:46pm

Yes, that sure looks like reward hacking
The problem here may be that we aren’t dealing with toxic data to begin with. There’s only so much AI can do…

Topic		Replies	Views
Feedback on Lab 3 Generative AI with Large Language Models week-module-3	1	416	September 15, 2023
Trust region in the PPO equation and KL divergence GenAI with LLMs Resources	2	444	October 19, 2023
Week 3: Video RLHF Reward Model Generative AI with Large Language Models week-module-3	0	319	November 18, 2023
Practical usefulness of RLHF in lab #3? Generative AI with Large Language Models coursera-platform	2	28	July 21, 2025
Week3 lab, the part given to the reward model using human feedback Generative AI with Large Language Models week-module-3 , faq	18	309	June 4, 2024

Lab 3 Qualitative Evaluation of PPO model; wonky results

Related topics