I have a question about the content of the lecture

I didn’t understand a lot of what was said in lecture :sweat_smile:
I would really appreciate it if you could help me

What are the corresponding elements in the figure below for Fine-tuning with RLHF?
action - ? , state - ?, environment - ?
(my guess → action - inference of instruct LLM, state - ?, environment - reward model)

Why do we minimize the loss that is calculated by the difference between the values of r sub j and r sub k when training a reward model?
Shouldn’t the difference between the highest-ranked r_j and the lower-ranked r_k be large, so that the model is more likely to produce similar results to r_j?

What is the role of each term in the equation?
I understand that policy loss is for stable learning, value loss is for learning human’s preference, and entropy loss is for creativity of completion model make but I want to know it for sure.

I noticed that the updated LLM and the initial LLM are both used in the calculation of the policy loss. Does this mean that two models are used in the fine-tuning with RLHF?

When training a constitutional LLM, the dataset consists of Constitutional responses and Original red-team prompts, and the reward model is used to assign high scores to Constitutional responses, and then the model is fine-tuned using the PPO algorithm right?