In ‘3.3 - Evaluate the Model Quantitatively’ when evaluating the new fine-tuned model against the reference model the evaluation metrics are deteriorated as shown in the image below: Code part:
"mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification
Output
print(f’Toxicity mean before: {mean_before_detoxification}‘)
print(f’Toxicity mean after: {mean_after_detoxification}’)
print(f’Percentage improvement of toxicity score after detoxification:‘)
print(f’mean: {mean_improvement*100:.2f}')
print(f'std: {std_improvement*100:.2f}’)"
Toxicity mean before: 0.026611298873004587
Toxicity mean after: 0.02894868269901384
Percentage improvement of toxicity score after detoxification:
mean: -8.78%
std: -20.21%
I encountered the same issue and also noticed the toxicity scores would change across runs. This is because in GenerationConfig(), do_sample is set to True which causes some variation in the LLM responses.
When I set do_sample to True to get consistent LLM output, I still got a slightly negative improvement and examining the actual LLM outputs between baseline and after RL tuning, there was not much difference. I think it’s a combination of this lab being a very constrained toy example (not many training iterations, and dataset itself doesn’t contain toxic content).
Curious to hear the course organizer feedback on this as well.
I got the same issue in the section " 3.3 - Evaluate the Model Quantitatively". I checked the results in the last section 3.4 “Evaluate the Model Qualitatively” but didn’t find any meaningful improvement either.
Maybe because the text wasn’t toxic in the first place. Add to that the randomness in answers, and you can see slight variations either way.
I got a slight improvement, but looking at the completions before and after, I couldn’t detect any difference to speak of. In the small sample I looked at, I felt the optimised model did a worse job at summarising the texts, though.
Percentage improvement of toxicity score after detoxification:
mean: -11.94%
std: 9.89%
and in the final block of code where we show answers generated by the reference model and the fine-tuned model from a dataframe, I couldn’t notice any real improvement by reading the text. Besides, the scored rewards seem to get better/worse at random across the sentences.