Week3 lab, the part given to the reward model using human feedback

Goomin · April 17, 2024, 2:39am

When creating a post, please add:

Week # must be added in the tags option of the post.
Link to the classroom item you are referring to: https://www.coursera.org/learn/generative-ai-with-llms/lecture/yPzI4/lab-3-walkthrough
Description (include relevant info but please do not post solution code or your entire notebook):

Hello, I am taking week3 lab.
I am working in a lab that practices RLHF. Is there a part where human feedback is delivered to the reward model?

From what I heard in the lecture, I understood that Human feedback was received, passed on to the reward model, and then the reward model learned it. But I don’t know where this part is in the lab.

Or, just use something called scaling-human-feedback in this exercise?

gent.spah · April 17, 2024, 6:33am

The reward model (as far as I remember now) is already trained with human feedback outside the lab!

Charlie_DataScience · April 17, 2024, 6:59pm

To complement the answer, yes the reward model is a pre-trained model that used human feedback to prepare it. You can actually choose any reward model suited for your needs when performing reinforcement learning, and can be even train your own reward model if you want to further customize your results.

Goomin · May 28, 2024, 4:52am

hello. I have more questions, so I’m leaving a reply.
Can the reward model be any language model?

gent.spah · May 28, 2024, 8:09am

It has to be in the context of what you are training!

Goomin · May 28, 2024, 8:57am

understood. But what I mean is this:

RoBERTa is a language model that applies the rewards of common languages.
But when training a reward model, which base reward model should you use? Can all language models be reward models?

gent.spah · May 28, 2024, 9:10am

Yes but as far as I understand one model whatever it be, has a training dataset and that cannot be universal in order to be applied to all language models ie. that reward model and main LMM need to be trained on the same purpose, same corpus etc.

Goomin · May 29, 2024, 1:18am

Thanks!!

Goomin · May 29, 2024, 8:35am

Sorry, I have another question.
What value model does the lab use?

From what I understand, it is as follows:
Reference Model (lora), PPO Model (lora), Reward Model

My head hurts, it’s hard to understand, please help me

gent.spah · May 29, 2024, 8:54am

The base model is model_name=“google/flan-t5-base”,

then you load the peft_model = PeftModel.from_pretrained with Lora configurations so it can be trained further. The peft model is an adaptive extended trained model that you have previously trained on the main model above.

Then,
“In this lab, you are preparing to fine-tune the LLM using Reinforcement Learning (RL). RL will be briefly discussed in the next section of this lab, but at this stage, you just need to prepare the Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model to it. PPO will be used to optimize the RL policy against the reward model.”

ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
torch_dtype=torch.bfloat16,
is_trainable=True)

Goomin · May 29, 2024, 9:24am

So what I understand is as follows. If not, please let me know.

The base model is FLAN-T5.

And we download the PEFT (Lora) model trained in the previous lab from s3.

Then we inject Lora config into the base model to check trainable parameters. And we create peft_model by combining the parameters with the lora (adapter) downloaded from s3.

And we make this peft_model into ppo_model.
Then create ref_modeld based on ppo_model.

Did I understand this correctly?

And we use the roberta model as the reward model and we did not train it separately. This is a reward model that has already been trained externally.

Did I understand this correctly?

gent.spah · May 29, 2024, 11:26am

Yes the reinforcement model is trained separately and is another model on the side checking our main model.

And we make this peft_model into ppo_model. No!

2 models are running here, the main model and the reinforcement model! I think you should re-go into the classes one more time, its been a long time I have done this as well!

Goomin · May 30, 2024, 12:44am

스크린샷 2024-05-30 오전 10.08.181830×952 88 KB

스크린샷 2024-05-30 오전 9.44.112340×366 41.8 KB

스크린샷 2024-05-30 오전 10.08.402204×298 23.5 KB

But look, it seems like the peft model is used as the ppo model.

gent.spah · May 30, 2024, 11:09am

No the peft model is an input to the ppo model!

Goomin · May 30, 2024, 11:37am

Even after re-watching the lecture and re-reading the code, I still don’t understand what you’re saying.

In the picture I uploaded above, the peft adapter is used in a variable called ppo model. But I don’t understand why you say no.

gent.spah · May 30, 2024, 12:03pm

I have corrected my statement above, If you still going though the lectures and dont understand whats going on and to be honest with you I am not better than the teaching instructors, maybe this is way over your head.

Goomin · May 30, 2024, 12:24pm

Oh yeah yeah yeah I understand now. I’m Korean and don’t know English well, so I use a translator, but there was a misunderstanding here.

Goomin · June 3, 2024, 7:46am

can i ask other question?

The PPO algorithm uses clipping to control the policy from changing excessively.

And KL divergence also plays a role related to policy.

Why should use KL divergence when PPO uses clipping?

gent.spah · June 4, 2024, 6:52am

KL divergence is a calculated quantity using a formula, clipping is just cut-off they are not of the same nature. It like normal loss plus regularization, and why not in the first place use them both if they are effective!

Topic		Replies	Views
Week 3: Video RLHF Reward Model Generative AI with Large Language Models week-3	0	316	November 18, 2023
Week 3 general question Generative AI with Large Language Models	3	43	December 1, 2024
Does reward model need retraining with domain specific inputs? Generative AI with Large Language Models week-3	2	301	November 5, 2023
Why build a rewards model in RLHF? Generative AI with Large Language Models week-3	1	356	October 26, 2023
Question about reward model in RLHF Generative AI with Large Language Models week-3	7	442	January 7, 2024

Week3 lab, the part given to the reward model using human feedback

스크린샷 2024-05-30 오전 10.08.181830×952 88 KB

스크린샷 2024-05-30 오전 9.44.112340×366 41.8 KB

스크린샷 2024-05-30 오전 10.08.402204×298 23.5 KB

Related topics