Week 3 general question

Hi everyone, I have a general question regarding the first part of the lectures of week 3, I would like to clarify the following:

When we want to align the model with the human feedback, we use RLHF, which is done by means of using a reward model. Once we prepared the dataset for the reward model, which is represented by pairs of completions, I am not sure I fully understand how we get to the point it which our reward model becomes a binary classifier. The lecture says we can use BERT trained using supervised learning on the pairwise comparison data.

Could you possibly clarify the last sentence? I am not sure I fully grasp the idea. Does it mean that we fine-tune BERT to deal with pairwise data? If so, then what is the input and what are the labels there? And how is it possible that in the end we manage to get to a binary classifier that we use as a result as a reward model?

Overview of the Reward Model

  1. Purpose: The reward model is used to evaluate and score the quality of different model outputs (completions) based on human preferences. The goal is to align the model’s behavior with human expectations.
  2. Pairwise Comparisons: The dataset for training the reward model often consists of pairs of completions where human annotators have indicated a preference for one completion over the other. This preference data is crucial for the reward model to learn what humans value in a response.

Using BERT as a Reward Model

  1. Fine-Tuning BERT: BERT, a pre-trained language model, can be fine-tuned to act as a reward model. The idea is to leverage BERT’s understanding of language to discern which completion in a pair is preferred based on human feedback.
  2. Input Format:
  • Pairwise Input: Each input to the model consists of a pair of completions, typically concatenated together with a special separator token (e.g., [SEP] in BERT).
  • Encoding: Each pair is encoded using BERT’s tokenizer, which prepares the input in a format suitable for BERT.
  1. Labels:
  • The labels for the pairwise data are binary, indicating which completion is preferred. For example, a label of 1 might indicate that the first completion is preferred, while a label of 0 might indicate that the second completion is preferred.
  1. Training as a Binary Classifier:
  • Objective: The training objective is to minimize the classification error on these pairwise comparisons. Essentially, BERT is fine-tuned to predict the binary label indicating the preferred completion.
  • Output: During training, the model learns to output a score or probability for each completion in the pair, which reflects the likelihood of that completion being preferred.
  1. Using the Model:
  • Once fine-tuned, the model can be used to score new completions. For a given completion, the model provides a score reflecting its alignment with human preferences.
  • These scores can then be used as rewards in RLHF to guide the main language model towards generating more preferred outputs.

Does the above help with your questions!

1 Like

Thaks for the explicit answer! I think I am still not sure how we get from stage 4 to stage 5 of Using BERT as a Reward Model. Namely, I do not understand how a fine-tuned model(the one that is fine-tuned on pairwise input) can used as a binary classifier. What I am trying to say is that when we want to use BERT to fine-tune an LLM, it should accept the prompt + its completion, while we fine-tuned it before to accept a pairwise input

and

Here is how an example will be used in this case:

Input to the Reward Model

  1. Completion Pairs:
  • Each input to the model consists of a pair of completions that were generated in response to the same prompt.
  • These completions are concatenated into a single input sequence, often separated by a special token (e.g., [SEP] in BERT).
  1. Example Input:
  • For the completion pair from the following example:
    • Completion A: “Regular exercise improves cardiovascular health, boosts mood, and increases energy levels.”
    • Completion B: “Exercise is good and can help you feel better.”
  • The input to the model might look like:

Copy code

"Regular exercise improves cardiovascular health, boosts mood, and increases energy levels. [SEP] Exercise is good and can help you feel better."

Labels

  • The label for each pair indicates which completion is preferred. In this case, if Completion A is preferred, the label would be 1 for Completion A and 0 for Completion B.
1 Like