Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics

In ‘3.3 - Evaluate the Model Quantitatively’ when evaluating the new fine-tuned model against the reference model the evaluation metrics are deteriorated as shown in the image below:
Code part:
"mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

Output
print(f’Toxicity mean before: {mean_before_detoxification}‘)
print(f’Toxicity mean after: {mean_after_detoxification}’)
print(f’Percentage improvement of toxicity score after detoxification:‘)
print(f’mean: {mean_improvement*100:.2f}') print(f'std: {std_improvement*100:.2f}’)"
Toxicity mean before: 0.026611298873004587
Toxicity mean after: 0.02894868269901384
Percentage improvement of toxicity score after detoxification:
mean: -8.78%
std: -20.21%

1 Like

Hi Marko, and welcome to the community! Thank you for the feedback. We’ll review this and update the notebook if necessary.

2 Likes

I get into same worse results after finetunning

2 Likes

I encountered the same issue and also noticed the toxicity scores would change across runs. This is because in GenerationConfig(), do_sample is set to True which causes some variation in the LLM responses.

When I set do_sample to True to get consistent LLM output, I still got a slightly negative improvement and examining the actual LLM outputs between baseline and after RL tuning, there was not much difference. I think it’s a combination of this lab being a very constrained toy example (not many training iterations, and dataset itself doesn’t contain toxic content).

Curious to hear the course organizer feedback on this as well.

1 Like

I got the same issue in the section " 3.3 - Evaluate the Model Quantitatively". I checked the results in the last section 3.4 “Evaluate the Model Qualitatively” but didn’t find any meaningful improvement either.

1 Like

Maybe because the text wasn’t toxic in the first place. Add to that the randomness in answers, and you can see slight variations either way.

I got a slight improvement, but looking at the completions before and after, I couldn’t detect any difference to speak of. In the small sample I looked at, I felt the optimised model did a worse job at summarising the texts, though.

1 Like

hi, same here

1 Like

Hi. I also notice the same issue.

Same issue here, I got

Percentage improvement of toxicity score after detoxification:
mean: -11.94%
std: 9.89%

and in the final block of code where we show answers generated by the reference model and the fine-tuned model from a dataframe, I couldn’t notice any real improvement by reading the text. Besides, the scored rewards seem to get better/worse at random across the sentences.

Hello Chirs,

are there any updates regarding the review and potential update of the notebook?

It seems that the same issue is recurring for multiple learners (e.g., Week3-Lab3-Detoxification - #2 by Anna_Kay).

Thank you!

Hi Anna. Sorry this was deprioritized a while back. We will look into it this week. I’ll update this thread by Friday. Thanks!

1 Like

Hi. Sorry this will be delayed to Tuesday next week. Will update this thread again by then. Thanks!