Lab 3 - decrease in toxicity score and performance after detoxification

After running the detoxification, the quantitative toxic score comparison are as follow:

Percentage improvement of toxicity score after detoxification:
mean: -56.05%
std: -31.30%

Is it normal for the model to perform worse after detoxification? What are the possible reasons for worsening performance?

Also, from qualitative comparison, the performance is worsen too. Do we need anther round of full-fine tuning/PEFT after detoxification?

Or did I do anything wrong?

Thanks in advance!

I think the detoxification process should run on the entire dataset to have better results.

make sense, thanks! :+1:

1 Like

Hi, from the Lab 3 walkthrough it follows that the notebook is supposed to decrease the mean toxicity score. It doesn’t. The goal of Lab 3 is to decrease toxicity. The steps in the notebook are there for students to learn how to use RL/PPO & PEDT to decrease toxicity. Are the steps wrong?

It should decrease toxicity, I cannot say the steps are wrong, experts of the field have come up with them. Maybe you are not fully training the model or maybe somethings has been changed from the original notebook.