I think it’s useful to walk through the steps of RLHF in the lab, and I can appreciate that with significantly more compute that some qualitative differences could likely be observed, but compared to the previous two labs this one seemed to have a lot less to show for the training we were able to do in the lab.
Percentage improvement of toxicity score after detoxification:
mean: 19.94%
std: 15.72%
It’s true, using the evaluation function showed that there was some percentage improvement to the toxicity score after detoxification, but by looking at the qualitative data, most of the summaries are pretty low quality… and they don’t often have much toxic content to start with.
It would have been great to have a toxic prompt summary completion before/after, especially since we have an example of exactly that around executable cells 13 and 14.
I realized this after I had already closed the lab, but it really wouldn’t take much work to reevaluate the toxic/nontoxic samples after the RLHF PPO training. Even if the difference was pretty trivial, it would have been nicer to see something qualitative to show for the training we were asked to sit through during the lab.
I ended up restarting & redoing the finetuning in the lab with a fresh kernel after I had realized this, but after running through all cells in one go, I somehow ended up with demonstrably worse training results:
toxicity [mean, std] before detox: [0.024924314866604454, 0.031004952346618646]
toxicity [mean, std] after detox: [0.0421026705510237, 0.04832343680199593]
And I was able to see that somehow, running a toxic prompt sample (use your imagination) through both the reference and ppo_model, the ppo_model (increased toxicity) did seem to include offensive words from the original dialogue more often.