KL divergence or trust region?

Malhar · July 7, 2024, 9:03pm

Classroom item: Reading: KL Divergence

Hi, in the reading material for KL Divergence in Week 3,
I see this statement (please see underlined in image below)

Earlier in the optional video for PPO it was mentioned that there is a concept of “trust region” that constrains the updated LLM to not be too far from the initial LLM. But now KL divergence is said to be used for that ? Which one is it ?

gent.spah · July 8, 2024, 9:01am

In this image and what they instruct, the KL Divergence is calculated between the frozen weights original LLM copy and the fine tuned (probably PEFT fine tuned) copy of the LLM!

KL divergence is calculated between these two to see how much the fine tuned model has drifted away from the original model when also trying to please the PPO model.

Malhar · July 8, 2024, 9:46am

Thank you for the reply, and I understand that.

But my question is that if KL divergence ensures the updated LLM doesnt stray far away from the original LLM, what is the need for the trust region in the policy loss function ?

gent.spah · July 8, 2024, 10:35am

Had to watch the video carefully, that trust region refers to the inner working of the PPO policy, in order so that it doesn’t move in large jumps and looses its path towards the better goal or human alignment policy.

The KL divergence is between the fine tuned LLM and the original LLM! In that Course it mentions the the PPO may reach a good human optimization of the LLM but it might not make any sense at least grammatically compared to the original LLM!

Malhar · July 15, 2024, 3:38am

Thank you for the reply.

I understand KL divergence is between RLHF LLM and original (ie. fine tuned) LLM

but then what is the trust region between ? Isn’t it also between the RLHF LLM and original LLM ?

gent.spah · July 15, 2024, 8:08am

Here the trust region refers to a trust region for movements in the PPO policy itself not the LLM!

Malhar · July 15, 2024, 9:30am

But small movement in PPO policy => small updates to original LLM (fine tuned/instruct) weights ?

So isn’t it (conceptually) doing the same
thing as the kl divergence penalty ?

gent.spah · July 15, 2024, 9:52am

KL divergence compares old LLM model to new fine tuned LLM model!

Trust region is movement allowed for PPO model!

The ultimate goal is the same, what happening are different things!

Why do we need it? Well it seems the experts who have tested it they needed it! Is it complex? It is, but the whole RLHF, PPO fine tuned LLM system is a complex system with lots of tunings happening!

Topic		Replies	Views
Trust region in the PPO equation and KL divergence GenAI with LLMs Resources	2	438	October 19, 2023
Why use KL divergence in PPO? Generative AI with Large Language Models week-module-3 , faq	1	178	July 16, 2024
Lab 3 Qualitative Evaluation of PPO model; wonky results Generative AI with Large Language Models week-module-3	1	443	July 24, 2023
Week3 lab, the part given to the reward model using human feedback Generative AI with Large Language Models week-module-3 , faq	18	290	June 4, 2024
Practical usefulness of RLHF in lab #3? Generative AI with Large Language Models coursera-platform	2	19	July 21, 2025

KL divergence or trust region?

Related topics