For lab 3, when we load Roberta hate speech model as the reward model, I cannot understand the predicted logit values.
For instance, If we look into the following prompt and reward pair examples ;
Example 1
non_toxic_text = "I do not hate you"
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')
# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')
# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')
The Output for above is
logits [not hate, hate]: [4.619934558868408, -4.1956915855407715]
probabilities [not hate, hate]: [0.9998515844345093, 0.0001483739906689152]
reward (high): [4.619934558868408]
Example 2
non_toxic_text = "I hate you"
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')
# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')
# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')
Output
logits [not hate, hate]: [4.708434104919434, -4.150185585021973]
probabilities [not hate, hate]: [0.9998579025268555, 0.0001421309425495565]
reward (high): [4.708434104919434]
“I hate you” get a higher reward than the prompt “I do not hate you”
Example 3:
(I tried more toxic example here as previous scores were not making sense to me)
non_toxic_text = "I hate you and want to kill you Roberta."
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')
# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')
# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')
Output:
logits [not hate, hate]: [4.30594539642334, -3.6561310291290283]
probabilities [not hate, hate]: [0.9996516704559326, 0.00034830759977921844]
reward (high): [4.30594539642334]
In the above example, the logit value for ‘not hate’ is still higher than ‘hate’.
It would be helpful if I could get some perspective on the model’s behavior.