Week3 lab, the part given to the reward model using human feedback

When creating a post, please add:

Hello, I am taking week3 lab.
I am working in a lab that practices RLHF. Is there a part where human feedback is delivered to the reward model?

From what I heard in the lecture, I understood that Human feedback was received, passed on to the reward model, and then the reward model learned it. But I don’t know where this part is in the lab.

Or, just use something called scaling-human-feedback in this exercise?

1 Like

The reward model (as far as I remember now) is already trained with human feedback outside the lab!

1 Like

To complement the answer, yes the reward model is a pre-trained model that used human feedback to prepare it. You can actually choose any reward model suited for your needs when performing reinforcement learning, and can be even train your own reward model if you want to further customize your results.

2 Likes

hello. I have more questions, so I’m leaving a reply.
Can the reward model be any language model?

It has to be in the context of what you are training!

understood. But what I mean is this:

RoBERTa is a language model that applies the rewards of common languages.
But when training a reward model, which base reward model should you use? Can all language models be reward models?

Yes but as far as I understand one model whatever it be, has a training dataset and that cannot be universal in order to be applied to all language models ie. that reward model and main LMM need to be trained on the same purpose, same corpus etc.

Thanks!!

1 Like

Sorry, I have another question.
What value model does the lab use?

From what I understand, it is as follows:
Reference Model (lora), PPO Model (lora), Reward Model

My head hurts, it’s hard to understand, please help me

The base model is model_name=β€œgoogle/flan-t5-base”,

then you load the peft_model = PeftModel.from_pretrained with Lora configurations so it can be trained further. The peft model is an adaptive extended trained model that you have previously trained on the main model above.

Then,
β€œIn this lab, you are preparing to fine-tune the LLM using Reinforcement Learning (RL). RL will be briefly discussed in the next section of this lab, but at this stage, you just need to prepare the Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model to it. PPO will be used to optimize the RL policy against the reward model.”

ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
torch_dtype=torch.bfloat16,
is_trainable=True)

So what I understand is as follows. If not, please let me know.

The base model is FLAN-T5.

And we download the PEFT (Lora) model trained in the previous lab from s3.

Then we inject Lora config into the base model to check trainable parameters. And we create peft_model by combining the parameters with the lora (adapter) downloaded from s3.

And we make this peft_model into ppo_model.
Then create ref_modeld based on ppo_model.

Did I understand this correctly?

And we use the roberta model as the reward model and we did not train it separately. This is a reward model that has already been trained externally.

Did I understand this correctly?

Yes the reinforcement model is trained separately and is another model on the side checking our main model.

And we make this peft_model into ppo_model. No!

2 models are running here, the main model and the reinforcement model! I think you should re-go into the classes one more time, its been a long time I have done this as well!

But look, it seems like the peft model is used as the ppo model.

No the peft model is an input to the ppo model!

Even after re-watching the lecture and re-reading the code, I still don’t understand what you’re saying.

In the picture I uploaded above, the peft adapter is used in a variable called ppo model. But I don’t understand why you say no.

I have corrected my statement above, If you still going though the lectures and dont understand whats going on and to be honest with you I am not better than the teaching instructors, maybe this is way over your head.

Oh yeah yeah yeah I understand now. I’m Korean and don’t know English well, so I use a translator, but there was a misunderstanding here.

can i ask other question?

The PPO algorithm uses clipping to control the policy from changing excessively.

And KL divergence also plays a role related to policy.

Why should use KL divergence when PPO uses clipping?

KL divergence is a calculated quantity using a formula, clipping is just cut-off they are not of the same nature. It like normal loss plus regularization, and why not in the first place use them both if they are effective!