PPO Fine-tune Metrics

When optimize policy with PPO using (query, response, reward) triplet, one of the metrics ‘ppo/policy/advantages_mean’ maximizes the advantages. How to understand this metrics? should it becomes bigger with more training epochs? I ran 3 training steps and look like it’s getting smaller.

Hi @Diana_Liu ,

This is also a topic a bit difficult to understand for me, but let me try an answer:

The PPO tries to favor actions that will have better results, and de-emphasize actions that will bring lower results. This process will, in principle, maximize the expected returns and getting closer to our goal.

If the PPO works, then as training steps pass, we would expect the model to be approaching the goal. So as we improve more and more, then the metric should tend to become smaller, because on each step the difference between the best action and the average action should be smaller.

Conversely, if the metric is increasing, that means that the distance between the ideal action and the average action is increasing which means that we are not getting closer to the goal.


1 Like