Convergene for dqn algorithm

I’m having some trouble understanding why the dqn algorithm converges towards the true state action value function. After reading some blog posts (especially this one(although this one discusses Q-tables not dqn but it feels as if the answer should be quite similar): Reinforcement Learning Explained Visually (Part 4): Q Learning, step-by-step | by Ketan Doshi | Towards Data Science)


I get the sense that it’s in the terminal states that the dqn algorithm starts to get more accurate approximations. With Q(s, a) = R(s) + gamma * max((Q(s’,a’)), if S happens to be a terminal state do we then need to define Q(s,a) = R(s) so that the terminal Q value accuracy “gets updated with solely real reward data and no estimated values”?

So basically my question is: is it in the terminal states that the dqn algorithm starts getting better at approximating the q values (aka do we have to have terminal states for the algorithm to work?), and if so do we need to set Q(s,a) = R(s) whenever s is a terminal state?

First, by definition, when you are at the terminal state, you have no other choice but Q(s,a) = R(s) because there is no more action to be taken, because you are at the terminal state.

Second, the DQN is getting improved over training steps, and the improvement happens not only at the terminal state. Also, IF it was true that there was improvement ONLY at the terminal state, it was not going to help us either because we need to train a DQN which can work at all other states. Let’s remember that DQN is state dependent.

So, each time we train the DQN, the DQN should improve and the improvement is accumulated over many times of training.