Mistake in Lab 3 training loop

To get the reward, the logits of the “nothate” class, a fixed no_hate _index is used. For instance, in the training loop:

reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

However, the index of the “nohate” class in the sentiment_pipe output is not fixed; it’s based on the predicted class. If the text is classified as non-toxic, “nothate” is at index 0. But if the predicted class is toxic, “nothate” is at index 1.

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]
For toxic text
[{'label': 'hate', 'score': 2.815366268157959}, {'label': 'nothate', 'score': -3.1568620204925537}]
[{'label': 'hate', 'score': 0.9974579215049744}, {'label': 'nothate', 'score': 0.0025420780293643475}]

Thus, when the output of the LLM is already toxic, the PPO algorithm will reward and train the model to be even more toxic.

Hi Christian, and welcome to the community! Thank you for the feedback. We’ll investigate and update the notebook if needed.