Verifying the correctness of Reinforcement Deep Learning

bandr · May 31, 2025, 7:14am

Now, I understand that we train the Q neural network such that when we have an input (s_1,a_1), the output is approximately the same as R(s_1)+q, where q is the maximum of the outputs of the same neural network with inputs (s_1’,a_1’) (a_1’ is variable). But how do we know that this summation of R(s_1) and q gives the correct value of Q(s_1,a_1), and is not offset by some constant? I understand recursion and why, in this case, it is not ‘offset’. What I mean to ask is, where have we specified the base conditions of the neural network, like in recursion?

conscell · June 3, 2025, 3:31am

Hi @bandr,
There’s a well-known proof under the tabular setting for convergence of Q-Learning. Recall the Bellman equation we use:

Q(s,a)=R(s) + \gamma \max_{a'}Q(s',a').

We are not just solving a recursive equation with arbitrary values. The reward R(s) provides a ground truth, just like a base case in recursion. When we are at terminal states we know that Q(s_{\rm terminal}, a)=R(s_{\rm terminal}), because there are no future rewards. Even though Q(s, a) depends on Q(s', a'), and so on recursively, the rewards act as the base cases that “pull” the entire value function into place.
Why doesn’t a constant offset persist? Let’s say you offset all Q-values by a constant C:

Q'(s,a)=Q(s,a)+C.

Then the Bellman equation becomes:

\begin{eqnarray} Q'(s,a) &=& R(s)+\gamma \max_{a'} Q'(s',a') \\ &=& R(s)+ \gamma(\max_{a'}Q(s',a') + C) \\ &=& R(s) + \gamma \max_{a'} Q(s',a') + \gamma C \end{eqnarray}

This doesn’t match Q'(s,a)=Q(s,a)+C, unless \gamma C = C. Since \gamma < 1 then C = 0.
Function approximation Q_{\theta}(s, a) by a neural network in DQN doesn’t not have strong convergence guarantees. A trick to get around this is to simply assume that a NN will eventually approximate a true function. Even in the absence of convergence, DQN often works well in practice.

Topic		Replies	Views
How does the neural network compute the Q function Unsupervised Learning, Recommenders, Reinforcement week-module-3	3	496	March 21, 2023
Neural network on bellman equation Unsupervised Learning, Recommenders, Reinforcement week-module-3	9	78	July 20, 2025
Question about DQN learning Unsupervised Learning, Recommenders, Reinforcement week-module-3	9	99	July 13, 2024
Unsupervised Learning: Bellman Equation example looks incorrect Unsupervised Learning, Recommenders, Reinforcement week-module-3	4	84	September 22, 2024
Convergene for dqn algorithm Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	496	August 4, 2022

Verifying the correctness of Reinforcement Deep Learning

Related topics