Week 3 general question

mmorello · November 27, 2024, 1:20am

Hi everyone, I have a general question regarding the first part of the lectures of week 3, I would like to clarify the following:

When we want to align the model with the human feedback, we use RLHF, which is done by means of using a reward model. Once we prepared the dataset for the reward model, which is represented by pairs of completions, I am not sure I fully understand how we get to the point it which our reward model becomes a binary classifier. The lecture says we can use BERT trained using supervised learning on the pairwise comparison data.

Could you possibly clarify the last sentence? I am not sure I fully grasp the idea. Does it mean that we fine-tune BERT to deal with pairwise data? If so, then what is the input and what are the labels there? And how is it possible that in the end we manage to get to a binary classifier that we use as a result as a reward model?

gent.spah · November 27, 2024, 5:57am

Overview of the Reward Model

Purpose: The reward model is used to evaluate and score the quality of different model outputs (completions) based on human preferences. The goal is to align the model’s behavior with human expectations.
Pairwise Comparisons: The dataset for training the reward model often consists of pairs of completions where human annotators have indicated a preference for one completion over the other. This preference data is crucial for the reward model to learn what humans value in a response.

Using BERT as a Reward Model

Fine-Tuning BERT: BERT, a pre-trained language model, can be fine-tuned to act as a reward model. The idea is to leverage BERT’s understanding of language to discern which completion in a pair is preferred based on human feedback.
Input Format:

Pairwise Input: Each input to the model consists of a pair of completions, typically concatenated together with a special separator token (e.g., [SEP] in BERT).
Encoding: Each pair is encoded using BERT’s tokenizer, which prepares the input in a format suitable for BERT.

Labels:

The labels for the pairwise data are binary, indicating which completion is preferred. For example, a label of 1 might indicate that the first completion is preferred, while a label of 0 might indicate that the second completion is preferred.

Training as a Binary Classifier:

Objective: The training objective is to minimize the classification error on these pairwise comparisons. Essentially, BERT is fine-tuned to predict the binary label indicating the preferred completion.
Output: During training, the model learns to output a score or probability for each completion in the pair, which reflects the likelihood of that completion being preferred.

Using the Model:

Once fine-tuned, the model can be used to score new completions. For a given completion, the model provides a score reflecting its alignment with human preferences.
These scores can then be used as rewards in RLHF to guide the main language model towards generating more preferred outputs.

Does the above help with your questions!

mmorello · November 29, 2024, 10:01pm

Thaks for the explicit answer! I think I am still not sure how we get from stage 4 to stage 5 of Using BERT as a Reward Model. Namely, I do not understand how a fine-tuned model(the one that is fine-tuned on pairwise input) can used as a binary classifier. What I am trying to say is that when we want to use BERT to fine-tune an LLM, it should accept the prompt + its completion, while we fine-tuned it before to accept a pairwise input

gent.spah · December 1, 2024, 9:59am

and

Here is how an example will be used in this case:

Input to the Reward Model

Completion Pairs:

Each input to the model consists of a pair of completions that were generated in response to the same prompt.
These completions are concatenated into a single input sequence, often separated by a special token (e.g., [SEP] in BERT).

Example Input:

For the completion pair from the following example:
- Completion A: “Regular exercise improves cardiovascular health, boosts mood, and increases energy levels.”
- Completion B: “Exercise is good and can help you feel better.”
The input to the model might look like:

Copy code

"Regular exercise improves cardiovascular health, boosts mood, and increases energy levels. [SEP] Exercise is good and can help you feel better."

Labels

The label for each pair indicates which completion is preferred. In this case, if Completion A is preferred, the label would be 1 for Completion A and 0 for Completion B.

Topic		Replies	Views
Question about reward model in RLHF Generative AI with Large Language Models week-module-3	7	461	January 7, 2024
Week 3: Video RLHF Reward Model Generative AI with Large Language Models week-module-3	0	317	November 18, 2023
Week3 lab, the part given to the reward model using human feedback Generative AI with Large Language Models week-module-3 , faq	18	266	June 4, 2024
RLHF... How? Generative AI for Everyone week-module-2	2	482	December 5, 2023
Sample-Efficient Training for Robots AI Discussions the-batch , ai-discussions	0	85	July 14, 2023

Week 3 general question

Overview of the Reward Model

Using BERT as a Reward Model

Input to the Reward Model

Labels

Related topics