Trust region in the PPO equation and KL divergence

Angela_Yeung · September 11, 2023, 9:08am

Hi,

I was just wondering how the trust region in the PPO equation and KL divergence work together? They both seem to constrain the RLHF version of the LLM so that it does not deviate too much from the original LLM.

Thanks so much!

Aaron_Hogancamp · October 19, 2023, 6:29pm

From what I gathered, the PPO equation is on the Reinforcement Learning model surrounding the LLM while the KL-Divergence is built into the PEFT methods such as LoRA. If I’m right, the KL-Divergence metric would be used in fine-tuning a model and would keep it from going “out of bounds”, while the PPO reward function would prevent reward hacking. Thoughts?

leonardo.pabon · October 19, 2023, 11:19pm

The reward model assesses LLM completions against some human preference. The PPO is a reinforcement learning algorithm that uses this result to calculate the update to the weights of the LLM.

If the PPO makes an excessive update, it may lead to problems like catastrophic forgetting. The KL Divergence is used to make sure that the updated policy doesn’t deviate too much from the original policy.

Topic		Replies	Views
KL divergence or trust region? Generative AI with Large Language Models week-module-3	7	58	July 15, 2024
Why use KL divergence in PPO? Generative AI with Large Language Models week-module-3 , faq	1	177	July 16, 2024
I have a question about the content of the lecture Generative AI with Large Language Models week-module-3	0	407	August 14, 2023
Lab 3 Qualitative Evaluation of PPO model; wonky results Generative AI with Large Language Models week-module-3	1	443	July 24, 2023
Critic model in PPO(Proximal policy Optimization) GenAI with LLMs Resources	1	393	February 5, 2025

Trust region in the PPO equation and KL divergence

Related topics