Toxicity mean value increased after detoxification

As you can see in the attached picture, the toxicity is increased from before (0.0378) and after detoxification (0.0430).

In this case, does it mean the model after detoxification actually performs worse?
Please kindly clarify. Thanks!

Hi I am getting the following:

Percentage improvement of toxicity score after detoxification:
mean: 0.38%
std: 12.21%

This is with:

toxicity [mean, std] after detox: [0.02905736598503691, 0.029726363734658947]

Hello, I’m having similar results:

mean_before_detoxification = 0.03147564082279463
mean_after_detoxification = 0.0357071926118806

My guess would be that because the training process is quite short and we are using only a subset of data, the inherent stochasticity of this process can sometimes produce results that are not coherent with the expectations. I assume that if you would run this with more data and epochs, you would get “correct” results every time. But this is just a guess. :slight_smile:


Thank you and it makes sense!

Yes, the decrease of mean toxicity noticeably small in my case and the standard deviation increased.

toxicity [mean, std] before detox: [0.035475403208031574, 0.03445820341137294]

toxicity [mean, std] after detox: [0.030459516253110698, 0.04322473559713335]

I would also guess that this has to do with the short amount of training time / epochs.

Hi. I got the same type of result and it is very likely due to the low number of training steps and rather non-toxic language as input.

Percentage improvement of toxicity score after detoxification:
mean: -23.40%
std: -33.33%

Thank you for the post.
I thought something is wrong with my reasoning. :slight_smile:

Well, I’ll try to do it locally with better settings.
maybe after detoxification I have less amount of extremely toxic answers.
Most probably it’s because we don’t have really toxic dialogs in training and test samples where we need to have detoxification. So maybe it just does not make any sense to detoxify this model especially using these samples.

thank you for clarification. I thought I didnt understood but at the end I got the right understanding