Trust region in the PPO equation and KL divergence

Hi,

I was just wondering how the trust region in the PPO equation and KL divergence work together? They both seem to constrain the RLHF version of the LLM so that it does not deviate too much from the original LLM.

Thanks so much!

1 Like

From what I gathered, the PPO equation is on the Reinforcement Learning model surrounding the LLM while the KL-Divergence is built into the PEFT methods such as LoRA. If I’m right, the KL-Divergence metric would be used in fine-tuning a model and would keep it from going “out of bounds”, while the PPO reward function would prevent reward hacking. Thoughts?

1 Like

The reward model assesses LLM completions against some human preference. The PPO is a reinforcement learning algorithm that uses this result to calculate the update to the weights of the LLM.

If the PPO makes an excessive update, it may lead to problems like catastrophic forgetting. The KL Divergence is used to make sure that the updated policy doesn’t deviate too much from the original policy.

1 Like