In many ActorâCritic reinforcement learning algorithms (including PPO), we typically talk about two main components:
The Actor (the policy network), which decides which action to take in a given state.
The Critic (the value function approximator), which evaluates how good it is to be in a certain state (or how good it is to take a certain action in that state).
This âcriticâ is essentially a learned function that estimates future returns (e.g., V(s) or Q(s,a)). It criticizes the actorâs actions by providing a scalar signal (the estimated value of being in a state or taking an action), which then helps update the policy more stably than a pure policy gradient alone.
What Is the Critic?
Value Function Approximator
The critic is usually a neural network trained to approximate one of the following: VĎ(s) The expected return (sum of discounted rewards) when starting from state s and following policy Ď.
QĎ(s,a) The expected return from state s after taking action a
a, then following policy Ď.
Why We Need It:
Policy gradient methods can have high variance in their gradient estimates. By using a critic (value function), we effectively have a more accurate measure of âhow good or bad the outcome of an action is,â which reduces variance and often speeds up training.
How Does the Critic Interact with the Actor?
Advantage Estimation
In algorithms like PPO, we typically use an advantage function At :
At=QĎ(st,at)âVĎ(st) or an equivalent approximation.
Because we have the criticâs estimate of VĎ(s)or QĎ(s,a)), we can compute:
Atâδt+(γΝ)δt+1+âŻ,
where δt=rt+ÎłVĎ(st+1)âVĎ(st) is the âtemporal differenceâ error.
This advantage At is then used to update the policy parameters (the actor). Essentially, the criticâs value estimates help the actor understand how much better or worse a chosen action was compared to what the critic expected. This feedback provides a more stable and lower-variance learning signal.
Why Would an Algorithm Not Use a Critic?
Pure Policy Gradient (No Critic):
You can compute policy gradient updates by sampling trajectories and using returns (sum of discounted rewards) directly to update the policy. This approach (e.g., REINFORCE) can work, but it tends to have higher variance because it doesnât subtract a baseline that captures the âaverageâ outcome.
Baselines or Other Methods:
Some methods incorporate a baseline (a simpler form of a critic) or other variance-reduction techniques. In your case, GRPO (Group Relative Policy Optimization) might use a different approach to reduce variance or shape the gradient, so it does not need a standard critic network.