Reinforcement learning made my lab model MORE toxic

cavhind123 · January 30, 2025, 2:23pm

Generative AI with Large Language Models course

Week 3
Lab 3 - Fine-tune FLAN-T5 with reinforcement learning to generate more-positive summaries

The training did not make the model more positive. Here is the result of the final part of section 2 - where we measure how toxic the un-tuned model is

11it [00:23, 2.18s/it]
toxicity [mean, std] before detox: [0.026959281965074213, 0.036199814536090356]

And then here are the results from section 3.3 where we measure how toxic the tuned model is, and how much of an improvement that is

11it [00:19, 1.73s/it]
toxicity [mean, std] after detox: [0.04409130260517651, 0.05958874047019413]

Percentage improvement of toxicity score after detoxification:
mean: -63.55%
std: -64.61%

ie the model has become MORE toxic

Have I done something wrong here?

Thanks,

Chris

chris.favila · January 30, 2025, 10:35pm

Hi Chris. We’ve forwarded this concern to our partners but still waiting for a resolution. A few other learners have reported it as well, but others also had the expected output. Or they were able to get a much better outcome after re-running the lab. There might be some random element here that’s affecting the results. Will let you know when we get updates. Thanks!

SleepyScholar · February 17, 2025, 8:49am

Same issue here:

toxicity [mean, std] before detox: [0.0183, 0.0240]
toxicity [mean, std] after detox: [0.0286, 0.0344]

Percentage improvement of toxicity score after detoxification:
mean: -56.47%
std: -43.10%

Topic		Replies	Views
Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics Generative AI with Large Language Models week-module-3	14	509	November 4, 2024
Week3-Lab3-Detoxification Generative AI with Large Language Models ai-discussions	2	42	September 16, 2024
Feedback on Lab 3 Generative AI with Large Language Models week-module-3	1	421	September 15, 2023
2.3 Evaluate Toxicity - Fine-Tune FLAN-T5 to Generate More-Positive Summaries Generative AI with Large Language Models week-module-3	1	480	July 1, 2023
Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries week #3 Generative AI with Large Language Models project	1	72	January 24, 2025

Reinforcement learning made my lab model MORE toxic

Generative AI with Large Language Models course

Related topics