Mistake in Lab 3 training loop

Christian_Gunzelmann · September 9, 2023, 11:09am

To get the reward, the logits of the “nothate” class, a fixed no_hate _index is used. For instance, in the training loop:

reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

However, the index of the “nohate” class in the sentiment_pipe output is not fixed; it’s based on the predicted class. If the text is classified as non-toxic, “nothate” is at index 0. But if the predicted class is toxic, “nothate” is at index 1.

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]
For toxic text
[{'label': 'hate', 'score': 2.815366268157959}, {'label': 'nothate', 'score': -3.1568620204925537}]
[{'label': 'hate', 'score': 0.9974579215049744}, {'label': 'nothate', 'score': 0.0025420780293643475}]

Thus, when the output of the LLM is already toxic, the PPO algorithm will reward and train the model to be even more toxic.

chris.favila · September 15, 2023, 11:23pm

Hi Christian, and welcome to the community! Thank you for the feedback. We’ll investigate and update the notebook if needed.

Elemento · July 5, 2025, 10:07am

Hey @chris.favila ,
This issue still exists, and hasn’t been corrected till now. Though, it would not have created a lot of issues in the model fine-tuning, but I believe it’s important to correct this issue, in case someone tries to replicate this code for another dataset.

For this dataset, since almost all the samples are non-toxic, the not_hate class would appear at index 0, which makes the code correct.

Here is the corrected code for your reference:

# Fixed Code
reward_tensors = []
for reward in rewards:
    for ind, rew in enumerate(reward):
        if rew['label'] == 'nothate': reward_not_hate_index = ind 
    reward_tensors.append(torch.tensor(reward[reward_not_hate_index]["score"]))

# Erroneous Code
# You use the `nothate` item because this is the score for the positive `nothate` class.
# reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

Cheers,
Elemento

chris.favila · July 5, 2025, 10:32am

Hi Elemento! It’s been a while and glad to see you here again. Thank you for resurfacing this issue. We escalated this some time ago but haven’t heard back. There’s been some changes and we might be able to push fixes faster here. Will prioritize this issue this coming work week. I realize this has come up in a few other threads. Thanks!

Elemento · July 5, 2025, 12:30pm

Hey Chris,
Likewise, I can see a lot of changes in the community. Looking forward to engage myself in some new and exciting courses.

Cheers,
Elemento

chris.favila · July 9, 2025, 11:01am

Hello. Just an update. This is under review. Will try to update by end of the week.

chris.favila · July 14, 2025, 5:57am

Update: This is still not revised because there seems to be another problem. The “query” is also passed in the toxicity evaluator (i.e. as query-response pairs). That can potentially obscure the score because the original text can pull it down even if the summary is already not toxic. We’re looking into this. Thanks.

Topic		Replies	Views
Potential Review Needed for Notebook Lab 3 Generative AI with Large Language Models week-module-3	3	160	May 29, 2024
Difficulty understanding Roberta reward model behavior Generative AI with Large Language Models week-module-3	2	421	November 20, 2023
Lab 3, 2.2 Reward Model Generative AI with Large Language Models week-module-3	0	267	January 6, 2024
Practical usefulness of RLHF in lab #3? Generative AI with Large Language Models coursera-platform	2	18	July 21, 2025
Feedback on Lab 3 Generative AI with Large Language Models week-module-3	1	415	September 15, 2023

Mistake in Lab 3 training loop

Related topics