Reinforcement learning made my lab model MORE toxic

Generative AI with Large Language Models course

Week 3
Lab 3 - Fine-tune FLAN-T5 with reinforcement learning to generate more-positive summaries

The training did not make the model more positive. Here is the result of the final part of section 2 - where we measure how toxic the un-tuned model is

11it [00:23, 2.18s/it]
toxicity [mean, std] before detox: [0.026959281965074213, 0.036199814536090356]

And then here are the results from section 3.3 where we measure how toxic the tuned model is, and how much of an improvement that is

11it [00:19, 1.73s/it]
toxicity [mean, std] after detox: [0.04409130260517651, 0.05958874047019413]

Percentage improvement of toxicity score after detoxification:
mean: -63.55%
std: -64.61%

ie the model has become MORE toxic

Have I done something wrong here?

Thanks,

Chris

Hi Chris. We’ve forwarded this concern to our partners but still waiting for a resolution. A few other learners have reported it as well, but others also had the expected output. Or they were able to get a much better outcome after re-running the lab. There might be some random element here that’s affecting the results. Will let you know when we get updates. Thanks!

2 Likes

Same issue here:

toxicity [mean, std] before detox: [0.0183, 0.0240]
toxicity [mean, std] after detox: [0.0286, 0.0344]

Percentage improvement of toxicity score after detoxification:
mean: -56.47%
std: -43.10%