The new summaries look wonky. It looks to me like the KL Divergence minimizer did not do a good job to prevent reward hacking. I’ve highlighted two suspects of reward hacking in the first few rows. Similar to this thread: Toxicity mean value increased after detoxification - #5 by gaspardbos I guess it’s because of not enough epochs. Anyone has any other ideas about this?
Yes, that sure looks like reward hacking
The problem here may be that we aren’t dealing with toxic data to begin with. There’s only so much AI can do…