Mistake in Lab 3 training loop

To get the reward, the logits of the “nothate” class, a fixed no_hate _index is used. For instance, in the training loop:

reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

However, the index of the “nohate” class in the sentiment_pipe output is not fixed; it’s based on the predicted class. If the text is classified as non-toxic, “nothate” is at index 0. But if the predicted class is toxic, “nothate” is at index 1.

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]
For toxic text
[{'label': 'hate', 'score': 2.815366268157959}, {'label': 'nothate', 'score': -3.1568620204925537}]
[{'label': 'hate', 'score': 0.9974579215049744}, {'label': 'nothate', 'score': 0.0025420780293643475}]

Thus, when the output of the LLM is already toxic, the PPO algorithm will reward and train the model to be even more toxic.

1 Like

Hi Christian, and welcome to the community! Thank you for the feedback. We’ll investigate and update the notebook if needed.

1 Like

Hey @chris.favila ,
This issue still exists, and hasn’t been corrected till now. Though, it would not have created a lot of issues in the model fine-tuning, but I believe it’s important to correct this issue, in case someone tries to replicate this code for another dataset.

For this dataset, since almost all the samples are non-toxic, the not_hate class would appear at index 0, which makes the code correct.

Here is the corrected code for your reference:

# Fixed Code
reward_tensors = []
for reward in rewards:
    for ind, rew in enumerate(reward):
        if rew['label'] == 'nothate': reward_not_hate_index = ind 
    reward_tensors.append(torch.tensor(reward[reward_not_hate_index]["score"]))

# Erroneous Code
# You use the `nothate` item because this is the score for the positive `nothate` class.
# reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]    

Cheers,
Elemento

1 Like

Hi Elemento! It’s been a while and glad to see you here again. Thank you for resurfacing this issue. We escalated this some time ago but haven’t heard back. There’s been some changes and we might be able to push fixes faster here. Will prioritize this issue this coming work week. I realize this has come up in a few other threads. Thanks!

1 Like

Hey Chris,
Likewise, I can see a lot of changes in the community. Looking forward to engage myself in some new and exciting courses.

Cheers,
Elemento

2 Likes

Hello. Just an update. This is under review. Will try to update by end of the week.

1 Like

Update: This is still not revised because there seems to be another problem. The “query” is also passed in the toxicity evaluator (i.e. as query-response pairs). That can potentially obscure the score because the original text can pull it down even if the summary is already not toxic. We’re looking into this. Thanks.

1 Like