The new summaries look wonky. It looks to me like the KL Divergence minimizer did not do a good job to prevent reward hacking. I’ve highlighted two suspects of reward hacking in the first few rows. Similar to this thread: Toxicity mean value increased after detoxification - #5 by gaspardbos I guess it’s because of not enough epochs. Anyone has any other ideas about this?
Yes, that sure looks like reward hacking ![]()
The problem here may be that we aren’t dealing with toxic data to begin with. There’s only so much AI can do…
