The last step of lab #3 is a qualitative comparison between the PEFT-model and the PPO model. In theory, the PPO model should be less toxic, but when looking at the examples, I’m left wondering whether the training was useful at all.
The completion pair with the highest difference in score is:
Original model:
- “Alice’s brought her money and she can’t go to see her mother because she’s ill and Alice has to keep her home.”
Aligned model:
- “Alice can’t go to see Mrs. Brown tomorrow morning because her mother is ill and she doesn’t want to visit her.!”
This pair has a reward delta of ~0.6. But has there been any meaningful learning? It feels as if, despite using the KL divergence, training this model on non-toxic data still induces a form of reward hacking.
For me, this actually is a sign of how careful you have to be when evaluating models. Sure, after applying RLHF as shown in the lab we achieve a mean toxicity decrease of 10%, but that doesn’t seem to have any real-world impact other than learning to play Meta’s toxicity classifier. I think this should be addressed in the course. Or better still, adapt the lab so the RLHF training has a visible positive effect.
PS. I tried to add the week-module-3
tag (or similar) to this post, but I can only see “No matches found” when typing it in the tags field.