I’m having some trouble understanding why the dqn algorithm converges towards the true state action value function. After reading some blog posts (especially this one(although this one discusses Q-tables not dqn but it feels as if the answer should be quite similar): Reinforcement Learning Explained Visually (Part 4): Q Learning, step-by-step | by Ketan Doshi | Towards Data Science)
I get the sense that it’s in the terminal states that the dqn algorithm starts to get more accurate approximations. With Q(s, a) = R(s) + gamma * max((Q(s’,a’)), if S happens to be a terminal state do we then need to define Q(s,a) = R(s) so that the terminal Q value accuracy “gets updated with solely real reward data and no estimated values”?
So basically my question is: is it in the terminal states that the dqn algorithm starts getting better at approximating the q values (aka do we have to have terminal states for the algorithm to work?), and if so do we need to set Q(s,a) = R(s) whenever s is a terminal state?