Description (include relevant info but please do not post solution code or your entire notebook):
Hello, I am taking week3 lab.
I am working in a lab that practices RLHF. Is there a part where human feedback is delivered to the reward model?
From what I heard in the lecture, I understood that Human feedback was received, passed on to the reward model, and then the reward model learned it. But I donβt know where this part is in the lab.
Or, just use something called scaling-human-feedback in this exercise?
To complement the answer, yes the reward model is a pre-trained model that used human feedback to prepare it. You can actually choose any reward model suited for your needs when performing reinforcement learning, and can be even train your own reward model if you want to further customize your results.
RoBERTa is a language model that applies the rewards of common languages.
But when training a reward model, which base reward model should you use? Can all language models be reward models?
Yes but as far as I understand one model whatever it be, has a training dataset and that cannot be universal in order to be applied to all language models ie. that reward model and main LMM need to be trained on the same purpose, same corpus etc.
The base model is model_name=βgoogle/flan-t5-baseβ,
then you load the peft_model = PeftModel.from_pretrained with Lora configurations so it can be trained further. The peft model is an adaptive extended trained model that you have previously trained on the main model above.
Then,
βIn this lab, you are preparing to fine-tune the LLM using Reinforcement Learning (RL). RL will be briefly discussed in the next section of this lab, but at this stage, you just need to prepare the Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model to it. PPO will be used to optimize the RL policy against the reward model.β
So what I understand is as follows. If not, please let me know.
The base model is FLAN-T5.
And we download the PEFT (Lora) model trained in the previous lab from s3.
Then we inject Lora config into the base model to check trainable parameters. And we create peft_model by combining the parameters with the lora (adapter) downloaded from s3.
And we make this peft_model into ppo_model.
Then create ref_modeld based on ppo_model.
Did I understand this correctly?
And we use the roberta model as the reward model and we did not train it separately. This is a reward model that has already been trained externally.
Yes the reinforcement model is trained separately and is another model on the side checking our main model.
And we make this peft_model into ppo_model. No!
2 models are running here, the main model and the reinforcement model! I think you should re-go into the classes one more time, its been a long time I have done this as well!
I have corrected my statement above, If you still going though the lectures and dont understand whats going on and to be honest with you I am not better than the teaching instructors, maybe this is way over your head.
KL divergence is a calculated quantity using a formula, clipping is just cut-off they are not of the same nature. It like normal loss plus regularization, and why not in the first place use them both if they are effective!