Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics

Marko_B · December 5, 2023, 9:57am

In ‘3.3 - Evaluate the Model Quantitatively’ when evaluating the new fine-tuned model against the reference model the evaluation metrics are deteriorated as shown in the image below:
Code part:
"mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

Output
print(f’Toxicity mean before: {mean_before_detoxification}‘)
print(f’Toxicity mean after: {mean_after_detoxification}’)
print(f’Percentage improvement of toxicity score after detoxification:‘)
print(f’mean: {mean_improvement*100:.2f}') print(f'std: {std_improvement*100:.2f}’)"
Toxicity mean before: 0.026611298873004587
Toxicity mean after: 0.02894868269901384
Percentage improvement of toxicity score after detoxification:
mean: -8.78%
std: -20.21%

chris.favila · December 6, 2023, 5:58am

Hi Marko, and welcome to the community! Thank you for the feedback. We’ll review this and update the notebook if necessary.

naonao0715 · December 24, 2023, 9:31am

I get into same worse results after finetunning

JinL · January 4, 2024, 10:49pm

I encountered the same issue and also noticed the toxicity scores would change across runs. This is because in GenerationConfig(), do_sample is set to True which causes some variation in the LLM responses.

When I set do_sample to True to get consistent LLM output, I still got a slightly negative improvement and examining the actual LLM outputs between baseline and after RL tuning, there was not much difference. I think it’s a combination of this lab being a very constrained toy example (not many training iterations, and dataset itself doesn’t contain toxic content).

Curious to hear the course organizer feedback on this as well.

John_Liu · February 27, 2024, 3:18am

I got the same issue in the section " 3.3 - Evaluate the Model Quantitatively". I checked the results in the last section 3.4 “Evaluate the Model Qualitatively” but didn’t find any meaningful improvement either.

artop · March 16, 2024, 8:02pm

Maybe because the text wasn’t toxic in the first place. Add to that the randomness in answers, and you can see slight variations either way.

I got a slight improvement, but looking at the completions before and after, I couldn’t detect any difference to speak of. In the small sample I looked at, I felt the optimised model did a worse job at summarising the texts, though.

Gulshat_Kossymova · March 29, 2024, 7:37am

hi, same here

Jeremie_Smaga · April 8, 2024, 9:35am

Hi. I also notice the same issue.

schmittjoaopedro · June 18, 2024, 10:26pm

Same issue here, I got

Percentage improvement of toxicity score after detoxification:
mean: -11.94%
std: 9.89%

and in the final block of code where we show answers generated by the reference model and the fine-tuned model from a dataframe, I couldn’t notice any real improvement by reading the text. Besides, the scored rewards seem to get better/worse at random across the sentences.

Anna_Kay · September 16, 2024, 8:04pm

Hello Chirs,

are there any updates regarding the review and potential update of the notebook?

It seems that the same issue is recurring for multiple learners (e.g., Week3-Lab3-Detoxification - #2 by Anna_Kay).

Thank you!

chris.favila · September 17, 2024, 2:24am

Hi Anna. Sorry this was deprioritized a while back. We will look into it this week. I’ll update this thread by Friday. Thanks!

chris.favila · September 20, 2024, 4:45pm

Hi. Sorry this will be delayed to Tuesday next week. Will update this thread again by then. Thanks!

balgheh · November 1, 2024, 1:34pm

Hi Chris,
Is this problem still there? cause I’ve run into the this report after running my notebook:

chris.favila · November 4, 2024, 3:12am

Hi everyone. I’m sorry for the delay. From what I’ve gathered so far, there’s some randomness that affects the results so it might be hard to reproduce the issue. You might even get positive results if you run it again. Will send a report about this today. Thanks Sajjad for the screenshot! This will be useful as well.

chris.favila · November 4, 2024, 4:44am

Hello again. The report has been submitted. Will wait for feedback from the team. Thanks!

Topic		Replies	Views
Week3-Lab3-Detoxification Generative AI with Large Language Models ai-discussions	2	27	September 16, 2024
Reinforcement learning made my lab model MORE toxic GenAI with LLMs Resources lab-help	2	44	February 17, 2025
Toxicity mean value increased after detoxification Generative AI with Large Language Models week-3	7	506	August 6, 2023
Lab 3 - decrease in toxicity score and performance after detoxification Generative AI with Large Language Models week-3	4	393	March 9, 2025
Feedback on Lab 3 Generative AI with Large Language Models week-3	1	412	September 15, 2023

Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics

Related topics