Now, I understand that we train the Q neural network such that when we have an input (s_1,a_1), the output is approximately the same as R(s_1)+q, where q is the maximum of the outputs of the same neural network with inputs (s_1’,a_1’) (a_1’ is variable). But how do we know that this summation of R(s_1) and q gives the correct value of Q(s_1,a_1), and is not offset by some constant? I understand recursion and why, in this case, it is not ‘offset’. What I mean to ask is, where have we specified the base conditions of the neural network, like in recursion?
Hi @bandr,
There’s a well-known proof under the tabular setting for convergence of Q-Learning. Recall the Bellman equation we use:
We are not just solving a recursive equation with arbitrary values. The reward R(s) provides a ground truth, just like a base case in recursion. When we are at terminal states we know that Q(s_{\rm terminal}, a)=R(s_{\rm terminal}), because there are no future rewards. Even though Q(s, a) depends on Q(s', a'), and so on recursively, the rewards act as the base cases that “pull” the entire value function into place.
Why doesn’t a constant offset persist? Let’s say you offset all Q-values by a constant C:
Then the Bellman equation becomes:
This doesn’t match Q'(s,a)=Q(s,a)+C, unless \gamma C = C. Since \gamma < 1 then C = 0.
Function approximation Q_{\theta}(s, a) by a neural network in DQN doesn’t not have strong convergence guarantees. A trick to get around this is to simply assume that a NN will eventually approximate a true function. Even in the absence of convergence, DQN often works well in practice.