KL divergence or trust region?

  • Classroom item: Reading: KL Divergence

Hi, in the reading material for KL Divergence in Week 3,
I see this statement (please see underlined in image below)

image

Earlier in the optional video for PPO it was mentioned that there is a concept of “trust region” that constrains the updated LLM to not be too far from the initial LLM. But now KL divergence is said to be used for that ? Which one is it ?

In this image and what they instruct, the KL Divergence is calculated between the frozen weights original LLM copy and the fine tuned (probably PEFT fine tuned) copy of the LLM!

KL divergence is calculated between these two to see how much the fine tuned model has drifted away from the original model when also trying to please the PPO model.

1 Like

Thank you for the reply, and I understand that.

But my question is that if KL divergence ensures the updated LLM doesnt stray far away from the original LLM, what is the need for the trust region in the policy loss function ?

Had to watch the video carefully, that trust region refers to the inner working of the PPO policy, in order so that it doesn’t move in large jumps and looses its path towards the better goal or human alignment policy.

The KL divergence is between the fine tuned LLM and the original LLM! In that Course it mentions the the PPO may reach a good human optimization of the LLM but it might not make any sense at least grammatically compared to the original LLM!

Thank you for the reply.

I understand KL divergence is between RLHF LLM and original (ie. fine tuned) LLM

but then what is the trust region between ? Isn’t it also between the RLHF LLM and original LLM ?

Here the trust region refers to a trust region for movements in the PPO policy itself not the LLM!

But small movement in PPO policy => small updates to original LLM (fine tuned/instruct) weights ?

So isn’t it (conceptually) doing the same
thing as the kl divergence penalty ?

KL divergence compares old LLM model to new fine tuned LLM model!

Trust region is movement allowed for PPO model!

The ultimate goal is the same, what happening are different things!

Why do we need it? Well it seems the experts who have tested it they needed it! Is it complex? It is, but the whole RLHF, PPO fine tuned LLM system is a complex system with lots of tunings happening!

1 Like