Difficulty understanding Roberta reward model behavior

Sabaina_Haroon · August 8, 2023, 7:47am

For lab 3, when we load Roberta hate speech model as the reward model, I cannot understand the predicted logit values.

For instance, If we look into the following prompt and reward pair examples ;

Example 1

non_toxic_text = "I do not hate you"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

The Output for above is

logits [not hate, hate]: [4.619934558868408, -4.1956915855407715]
probabilities [not hate, hate]: [0.9998515844345093, 0.0001483739906689152]
reward (high): [4.619934558868408]

Example 2

non_toxic_text = "I hate you"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

Output

logits [not hate, hate]: [4.708434104919434, -4.150185585021973]
probabilities [not hate, hate]: [0.9998579025268555, 0.0001421309425495565]
reward (high): [4.708434104919434]

“I hate you” get a higher reward than the prompt “I do not hate you”

Example 3:
(I tried more toxic example here as previous scores were not making sense to me)

non_toxic_text = "I hate you and want to kill you Roberta."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

Output:

logits [not hate, hate]: [4.30594539642334, -3.6561310291290283]
probabilities [not hate, hate]: [0.9996516704559326, 0.00034830759977921844]
reward (high): [4.30594539642334]

In the above example, the logit value for ‘not hate’ is still higher than ‘hate’.

It would be helpful if I could get some perspective on the model’s behavior.

gent.spah · August 8, 2023, 10:02am

Hello @Sabaina_Haroon this model is pretty small compared to larger ones like chatgpt and its accuracy may not be great, and also if its funed tuned might no be fune tuned on a large dataset.

Anthony_Liu · November 20, 2023, 3:08am

I am getting the same behavior and this confused the heck out of me. I threw in some vulgarity in the toxic text example and the model is still giving 99%+ percent non-hate prediction.

Topic		Replies	Views
Lab 3, 2.2 Reward Model Generative AI with Large Language Models week-module-3	0	267	January 6, 2024
Mistake in Lab 3 training loop Generative AI with Large Language Models week-module-3	6	397	July 14, 2025
Is it a typo in the loss function of Reward model in Week3? Generative AI with Large Language Models week-module-3	3	417	September 15, 2023
W3 - RLHF Reward Model - loss of reward model Generative AI with Large Language Models week-module-3	1	367	October 1, 2023
Question about reward model in RLHF Generative AI with Large Language Models week-module-3	7	517	January 7, 2024

Difficulty understanding Roberta reward model behavior

Related topics