Critic model in PPO(Proximal policy Optimization)

dinesh1306 · February 4, 2025, 8:43pm

Hi,
I was reading about the difference between Group
Relative Policy Optimization(GRPO) and Proximal Policy Optimization (PPO).
And the key difference is the use of critic model in PPO vs ditching it in GRPO.

Can someone explain what this critic model is? There is no mention of a critic model in the lecture videos.

Sorry if i am being too naive here.

Igor_Pereverzev · February 5, 2025, 8:15am

In many Actor–Critic reinforcement learning algorithms (including PPO), we typically talk about two main components:

The Actor (the policy network), which decides which action to take in a given state.
The Critic (the value function approximator), which evaluates how good it is to be in a certain state (or how good it is to take a certain action in that state).
This “critic” is essentially a learned function that estimates future returns (e.g., V(s) or Q(s,a)). It criticizes the actor’s actions by providing a scalar signal (the estimated value of being in a state or taking an action), which then helps update the policy more stably than a pure policy gradient alone.

What Is the Critic?

Value Function Approximator
The critic is usually a neural network trained to approximate one of the following: Vπ(s) The expected return (sum of discounted rewards) when starting from state s and following policy π.
Qπ(s,a) The expected return from state s after taking action a
a, then following policy π.
Why We Need It:
Policy gradient methods can have high variance in their gradient estimates. By using a critic (value function), we effectively have a more accurate measure of “how good or bad the outcome of an action is,” which reduces variance and often speeds up training.
How Does the Critic Interact with the Actor?

Advantage Estimation
In algorithms like PPO, we typically use an advantage function At :
At=Qπ(st,at)−Vπ(st) or an equivalent approximation.
Because we have the critic’s estimate of Vπ(s)or Qπ(s,a)), we can compute:

At≈δt+(γλ)δt+1+⋯,
where δt=rt+γVπ(st+1)−Vπ(st) is the “temporal difference” error.

This advantage At is then used to update the policy parameters (the actor). Essentially, the critic’s value estimates help the actor understand how much better or worse a chosen action was compared to what the critic expected. This feedback provides a more stable and lower-variance learning signal.

Why Would an Algorithm Not Use a Critic?

Pure Policy Gradient (No Critic):
You can compute policy gradient updates by sampling trajectories and using returns (sum of discounted rewards) directly to update the policy. This approach (e.g., REINFORCE) can work, but it tends to have higher variance because it doesn’t subtract a baseline that captures the “average” outcome.
Baselines or Other Methods:
Some methods incorporate a baseline (a simpler form of a critic) or other variance-reduction techniques. In your case, GRPO (Group Relative Policy Optimization) might use a different approach to reduce variance or shape the gradient, so it does not need a standard critic network.

Topic		Replies	Views
PPO, how do we compute de advantage and the Value Function?/ Generative AI with Large Language Models	0	294	December 30, 2023
PPO Fine-tune Metrics Generative AI with Large Language Models week-module-3	1	432	July 19, 2023
Why use KL divergence in PPO? Generative AI with Large Language Models week-module-3 , faq	1	177	July 16, 2024
Trust region in the PPO equation and KL divergence GenAI with LLMs Resources	2	438	October 19, 2023
Clarification on Optional video: Proximal policy optimization Generative AI with Large Language Models week-module-3	0	439	July 4, 2023

Critic model in PPO(Proximal policy Optimization)

Related topics