Lab 3 Qualitative Evaluation of PPO model; wonky results

The new summaries look wonky. It looks to me like the KL Divergence minimizer did not do a good job to prevent reward hacking. I’ve highlighted two suspects of reward hacking in the first few rows. Similar to this thread: Toxicity mean value increased after detoxification - #5 by gaspardbos I guess it’s because of not enough epochs. Anyone has any other ideas about this?

Yes, that sure looks like reward hacking :slight_smile:
The problem here may be that we aren’t dealing with toxic data to begin with. There’s only so much AI can do…