Hi~
I didn’t understand a lot of what was said in lecture
I would really appreciate it if you could help me
Q1)
What are the corresponding elements in the figure below for Fine-tuning with RLHF?
action - ? , state - ?, environment - ?
(my guess → action - inference of instruct LLM, state - ?, environment - reward model)
Q2)
Why do we minimize the loss that is calculated by the difference between the values of r sub j and r sub k when training a reward model?
Shouldn’t the difference between the highest-ranked r_j and the lower-ranked r_k be large, so that the model is more likely to produce similar results to r_j?
Q3)
What is the role of each term in the equation?
I understand that policy loss is for stable learning, value loss is for learning human’s preference, and entropy loss is for creativity of completion model make but I want to know it for sure.
Q4)
I noticed that the updated LLM and the initial LLM are both used in the calculation of the policy loss. Does this mean that two models are used in the fine-tuning with RLHF?
Q5)
When training a constitutional LLM, the dataset consists of Constitutional responses and Original red-team prompts, and the reward model is used to assign high scores to Constitutional responses, and then the model is fine-tuned using the PPO algorithm right?