I have a question about the content of the lecture

Nadle · August 14, 2023, 3:22pm

Hi~
I didn’t understand a lot of what was said in lecture
I would really appreciate it if you could help me

Q1)
What are the corresponding elements in the figure below for Fine-tuning with RLHF?
action - ? , state - ?, environment - ?
(my guess → action - inference of instruct LLM, state - ?, environment - reward model)

Q2)
Why do we minimize the loss that is calculated by the difference between the values of r sub j and r sub k when training a reward model?
Shouldn’t the difference between the highest-ranked r_j and the lower-ranked r_k be large, so that the model is more likely to produce similar results to r_j?

Q3)
What is the role of each term in the equation?
I understand that policy loss is for stable learning, value loss is for learning human’s preference, and entropy loss is for creativity of completion model make but I want to know it for sure.

Q4)
I noticed that the updated LLM and the initial LLM are both used in the calculation of the policy loss. Does this mean that two models are used in the fine-tuning with RLHF?

Q5)
When training a constitutional LLM, the dataset consists of Constitutional responses and Original red-team prompts, and the reward model is used to assign high scores to Constitutional responses, and then the model is fine-tuned using the PPO algorithm right?

Topic		Replies	Views
Question on the loss function of reward model Generative AI with Large Language Models week-3	1	54	July 15, 2024
RLHF... How? Generative AI for Everyone week-2	2	479	December 5, 2023
Need more clarity on Constitutional AI Generative AI with Large Language Models week-3	4	354	October 17, 2023
Why Log Sigmoid log(σ(r_j - r_k)) as loss function to train reward model? GenAI with LLMs Resources	12	840	September 26, 2024
Question about reward model in RLHF Generative AI with Large Language Models week-3	7	447	January 7, 2024

I have a question about the content of the lecture

Related topics