Toxicity mean value increased after detoxification

chong2 · July 7, 2023, 1:46am

Hi,
As you can see in the attached picture, the toxicity is increased from before (0.0378) and after detoxification (0.0430).

In this case, does it mean the model after detoxification actually performs worse?
Please kindly clarify. Thanks!

Kasowari · July 7, 2023, 3:12pm

Hi I am getting the following:

Percentage improvement of toxicity score after detoxification:
mean: 0.38%
std: 12.21%

This is with:

toxicity [mean, std] after detox: [0.02905736598503691, 0.029726363734658947]

Matous_Eibich · July 11, 2023, 4:20pm

Hello, I’m having similar results:

mean_before_detoxification = 0.03147564082279463
mean_after_detoxification = 0.0357071926118806

My guess would be that because the training process is quite short and we are using only a subset of data, the inherent stochasticity of this process can sometimes produce results that are not coherent with the expectations. I assume that if you would run this with more data and epochs, you would get “correct” results every time. But this is just a guess.

chong2 · July 12, 2023, 3:44am

Thank you and it makes sense!

gaspardbos · July 20, 2023, 10:04am

Yes, the decrease of mean toxicity noticeably small in my case and the standard deviation increased.

toxicity [mean, std] before detox: [0.035475403208031574, 0.03445820341137294]

toxicity [mean, std] after detox: [0.030459516253110698, 0.04322473559713335]

I would also guess that this has to do with the short amount of training time / epochs.

IstvanZK · July 24, 2023, 8:26pm

Hi. I got the same type of result and it is very likely due to the low number of training steps and rather non-toxic language as input.

Percentage improvement of toxicity score after detoxification:
mean: -23.40%
std: -33.33%

dkurilo · August 2, 2023, 3:58am

Thank you for the post.
I thought something is wrong with my reasoning.

Well, I’ll try to do it locally with better settings.
maybe after detoxification I have less amount of extremely toxic answers.
Most probably it’s because we don’t have really toxic dialogs in training and test samples where we need to have detoxification. So maybe it just does not make any sense to detoxify this model especially using these samples.

FFloegel · August 6, 2023, 8:48pm

thank you for clarification. I thought I didnt understood but at the end I got the right understanding

Topic		Replies	Views
Week3-Lab3-Detoxification Generative AI with Large Language Models ai-discussions	2	27	September 16, 2024
Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics Generative AI with Large Language Models week-3	14	480	November 4, 2024
Lab 3 - decrease in toxicity score and performance after detoxification Generative AI with Large Language Models week-3	4	393	March 9, 2025
Reinforcement learning made my lab model MORE toxic GenAI with LLMs Resources lab-help	2	44	February 17, 2025
Feedback on Lab 3 Generative AI with Large Language Models week-3	1	412	September 15, 2023

Toxicity mean value increased after detoxification

Related topics