Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics

In ‘3.3 - Evaluate the Model Quantitatively’ when evaluating the new fine-tuned model against the reference model the evaluation metrics are deteriorated as shown in the image below:
Code part:
"mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

Output
print(f’Toxicity mean before: {mean_before_detoxification}‘)
print(f’Toxicity mean after: {mean_after_detoxification}’)
print(f’Percentage improvement of toxicity score after detoxification:‘)
print(f’mean: {mean_improvement*100:.2f}') print(f'std: {std_improvement*100:.2f}’)"
Toxicity mean before: 0.026611298873004587
Toxicity mean after: 0.02894868269901384
Percentage improvement of toxicity score after detoxification:
mean: -8.78%
std: -20.21%

1 Like

Hi Marko, and welcome to the community! Thank you for the feedback. We’ll review this and update the notebook if necessary.

1 Like

I get into same worse results after finetunning

1 Like

I encountered the same issue and also noticed the toxicity scores would change across runs. This is because in GenerationConfig(), do_sample is set to True which causes some variation in the LLM responses.

When I set do_sample to True to get consistent LLM output, I still got a slightly negative improvement and examining the actual LLM outputs between baseline and after RL tuning, there was not much difference. I think it’s a combination of this lab being a very constrained toy example (not many training iterations, and dataset itself doesn’t contain toxic content).

Curious to hear the course organizer feedback on this as well.

1 Like

I got the same issue in the section " 3.3 - Evaluate the Model Quantitatively". I checked the results in the last section 3.4 “Evaluate the Model Qualitatively” but didn’t find any meaningful improvement either.