PPO Fine-tune Metrics

Diana_Liu · July 19, 2023, 8:01pm

When optimize policy with PPO using (query, response, reward) triplet, one of the metrics ‘ppo/policy/advantages_mean’ maximizes the advantages. How to understand this metrics? should it becomes bigger with more training epochs? I ran 3 training steps and look like it’s getting smaller.

Juan_Olano · July 19, 2023, 8:46pm

Hi @Diana_Liu ,

This is also a topic a bit difficult to understand for me, but let me try an answer:

The PPO tries to favor actions that will have better results, and de-emphasize actions that will bring lower results. This process will, in principle, maximize the expected returns and getting closer to our goal.

If the PPO works, then as training steps pass, we would expect the model to be approaching the goal. So as we improve more and more, then the metric should tend to become smaller, because on each step the difference between the best action and the average action should be smaller.

Conversely, if the metric is increasing, that means that the distance between the ideal action and the average action is increasing which means that we are not getting closer to the goal.

Thoughts?

Topic		Replies	Views
PPO, how do we compute de advantage and the Value Function?/ Generative AI with Large Language Models	0	293	December 30, 2023
I have a question about the content of the lecture Generative AI with Large Language Models week-module-3	0	407	August 14, 2023
Critic model in PPO(Proximal policy Optimization) GenAI with LLMs Resources	1	303	February 5, 2025
Lab 3 Qualitative Evaluation of PPO model; wonky results Generative AI with Large Language Models week-module-3	1	443	July 24, 2023
PPO model parameters Generative AI with Large Language Models week-module-3	2	318	November 10, 2023

PPO Fine-tune Metrics

Related topics