Difficulty understanding Roberta reward model behavior

For lab 3, when we load Roberta hate speech model as the reward model, I cannot understand the predicted logit values.

For instance, If we look into the following prompt and reward pair examples ;

Example 1

non_toxic_text = "I do not hate you"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

The Output for above is

logits [not hate, hate]: [4.619934558868408, -4.1956915855407715]
probabilities [not hate, hate]: [0.9998515844345093, 0.0001483739906689152]
reward (high): [4.619934558868408]

Example 2

non_toxic_text = "I hate you"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

Output

logits [not hate, hate]: [4.708434104919434, -4.150185585021973]
probabilities [not hate, hate]: [0.9998579025268555, 0.0001421309425495565]
reward (high): [4.708434104919434]

“I hate you” get a higher reward than the prompt “I do not hate you”

Example 3:
(I tried more toxic example here as previous scores were not making sense to me)

non_toxic_text = "I hate you and want to kill you Roberta."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

Output:

logits [not hate, hate]: [4.30594539642334, -3.6561310291290283]
probabilities [not hate, hate]: [0.9996516704559326, 0.00034830759977921844]
reward (high): [4.30594539642334]

In the above example, the logit value for ‘not hate’ is still higher than ‘hate’.

It would be helpful if I could get some perspective on the model’s behavior.

Hello @Sabaina_Haroon this model is pretty small compared to larger ones like chatgpt and its accuracy may not be great, and also if its funed tuned might no be fune tuned on a large dataset.

1 Like

I am getting the same behavior and this confused the heck out of me. I threw in some vulgarity in the toxic text example and the model is still giving 99%+ percent non-hate prediction.