Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics

In ‘3.3 - Evaluate the Model Quantitatively’ when evaluating the new fine-tuned model against the reference model the evaluation metrics are deteriorated as shown in the image below:
Code part:
"mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

Output
print(f’Toxicity mean before: {mean_before_detoxification}‘)
print(f’Toxicity mean after: {mean_after_detoxification}’)
print(f’Percentage improvement of toxicity score after detoxification:‘)
print(f’mean: {mean_improvement*100:.2f}') print(f'std: {std_improvement*100:.2f}’)"
Toxicity mean before: 0.026611298873004587
Toxicity mean after: 0.02894868269901384
Percentage improvement of toxicity score after detoxification:
mean: -8.78%
std: -20.21%

1 Like

Hi Marko, and welcome to the community! Thank you for the feedback. We’ll review this and update the notebook if necessary.

1 Like

I get into same worse results after finetunning

1 Like

I encountered the same issue and also noticed the toxicity scores would change across runs. This is because in GenerationConfig(), do_sample is set to True which causes some variation in the LLM responses.

When I set do_sample to True to get consistent LLM output, I still got a slightly negative improvement and examining the actual LLM outputs between baseline and after RL tuning, there was not much difference. I think it’s a combination of this lab being a very constrained toy example (not many training iterations, and dataset itself doesn’t contain toxic content).

Curious to hear the course organizer feedback on this as well.

1 Like

I got the same issue in the section " 3.3 - Evaluate the Model Quantitatively". I checked the results in the last section 3.4 “Evaluate the Model Qualitatively” but didn’t find any meaningful improvement either.

1 Like

Maybe because the text wasn’t toxic in the first place. Add to that the randomness in answers, and you can see slight variations either way.

I got a slight improvement, but looking at the completions before and after, I couldn’t detect any difference to speak of. In the small sample I looked at, I felt the optimised model did a worse job at summarising the texts, though.

1 Like

hi, same here

1 Like

Hi. I also notice the same issue.

Same issue here, I got

Percentage improvement of toxicity score after detoxification:
mean: -11.94%
std: 9.89%

and in the final block of code where we show answers generated by the reference model and the fine-tuned model from a dataframe, I couldn’t notice any real improvement by reading the text. Besides, the scored rewards seem to get better/worse at random across the sentences.