Lab 3 - decrease in toxicity score and performance after detoxification

After running the detoxification, the quantitative toxic score comparison are as follow:

Percentage improvement of toxicity score after detoxification:
mean: -56.05%
std: -31.30%

Is it normal for the model to perform worse after detoxification? What are the possible reasons for worsening performance?

Also, from qualitative comparison, the performance is worsen too. Do we need anther round of full-fine tuning/PEFT after detoxification?

Or did I do anything wrong?

Thanks in advance!

I think the detoxification process should run on the entire dataset to have better results.

make sense, thanks! :+1:

