Hi, I understand why we have a Q_hat network in order to have stable y_target values for the sake of learning stability, and we update slightly Q_hat’s parameters using the Q’s parameter after each training iterations (at every C time steps) with a soft update.
My question is even though soft update uses a tau << 1, do the two networks’ parameters eventually converge? I don’t think this was discussed in either lectures or labs.
Under certain conditions the parameters of the target \hat{Q} network and the main Q network (parameterized by \theta_\hat{Q} and \theta_Q respectively) will eventually converge.
The soft update rule for the target network is: \theta_{\hat{Q}} \leftarrow \tau \theta_{Q} + (1 - \tau) \theta_{\hat{Q}},
where \tau \ll 1 is a small positive scalar (e.g. \tau = 0.001).
As you can see from the equation, over time this causes \theta_{\hat{Q}} to approach \theta_{Q}, provided \theta_{Q} stops changing or its changes slow down. This is possible when the learning process stabilizes (the environment is not excessively stochastic, and the agent learns effectively); the loss function is minimized, and the Q-values approach the true expected return. However, in practice they may not fully converge. Continuous environment exploration (via an \epsilon-greedy policy) or high stochasticity in the environment may never stabilize \theta_Q completely.
The focus in reinforcement learning is often on stabilizing training and ensuring convergence of the Q network itself, and \hat{Q} is used to provide a slowly evolving estimate of the expected return. This paper shows that under assumption of the existence of a good DNN approximation to the optimal Q-value function and some other assumptions, the DQN algorithm will converge.