Feedback on Lab 3

gilgamesh · August 15, 2023, 9:26pm

I think it’s useful to walk through the steps of RLHF in the lab, and I can appreciate that with significantly more compute that some qualitative differences could likely be observed, but compared to the previous two labs this one seemed to have a lot less to show for the training we were able to do in the lab.

Percentage improvement of toxicity score after detoxification:
mean: 19.94%
std: 15.72%

It’s true, using the evaluation function showed that there was some percentage improvement to the toxicity score after detoxification, but by looking at the qualitative data, most of the summaries are pretty low quality… and they don’t often have much toxic content to start with.

It would have been great to have a toxic prompt summary completion before/after, especially since we have an example of exactly that around executable cells 13 and 14.

I realized this after I had already closed the lab, but it really wouldn’t take much work to reevaluate the toxic/nontoxic samples after the RLHF PPO training. Even if the difference was pretty trivial, it would have been nicer to see something qualitative to show for the training we were asked to sit through during the lab.

I ended up restarting & redoing the finetuning in the lab with a fresh kernel after I had realized this, but after running through all cells in one go, I somehow ended up with demonstrably worse training results:

toxicity [mean, std] before detox: [0.024924314866604454, 0.031004952346618646]
toxicity [mean, std] after detox: [0.0421026705510237, 0.04832343680199593]

And I was able to see that somehow, running a toxic prompt sample (use your imagination) through both the reference and ppo_model, the ppo_model (increased toxicity) did seem to include offensive words from the original dialogue more often.

chris.favila · September 15, 2023, 11:27pm

Hi Ryan, and welcome to the community! That’s an interesting observation. We’ll investigate and improve the notebook if needed. Thank you for sharing!

Topic		Replies	Views
Lab 3 - decrease in toxicity score and performance after detoxification Generative AI with Large Language Models week-3	4	394	March 9, 2025
Week3-Lab3-Detoxification Generative AI with Large Language Models ai-discussions	2	29	September 16, 2024
Lab 3, 3.3 Toxicity is worse after fine tuning according to metrics Generative AI with Large Language Models week-3	14	481	November 4, 2024
Reinforcement learning made my lab model MORE toxic GenAI with LLMs Resources lab-help	2	46	February 17, 2025
Toxicity mean value increased after detoxification Generative AI with Large Language Models week-3	7	506	August 6, 2023

Feedback on Lab 3

Related topics