In supervised learning, we initialize parameter and then improve parameter through model training which is to minimize the gap between predicted y and true y. But in this DQNetwork, we initialize part of component of y which is maxQ(s prime, a prime). Then how do we improve y since part of y is a random guess?
Hi @flyunicorn
In DQNs, the target value y = r + \gamma \max_{a'} Q(s', a') includes \max Q(s', a') that is estimated using the current network or a target network. Initially this estimate may be inaccurate but it improves over time as the Q-network is trained. The key is that although part of y is based on a current approximation, training iteratively refines the Q-values through bootstrapping—using the network’s own improving predictions to update itself—that increases accurate targets as learning progresses.
Hope it helps! Feel free to ask if you need further assistance.
Hi Alireza, you said “In DQNs, the target value y=r+γmaxa′Q(s′,a′) includes maxQ(s′,a′) that is estimated using the current network or a target network”, what is this current network? Is it the neural network below? It sounds like there are 2 neural network involved which I’m confused of the setup.
Yes! The “current network” is indeed the neural network shown in the Deep Reinforcement Learning slide, it’s the one used to approximate Q(s, a) at each step. In many DQN setups (at least what I did in my recent project), there’s also a separate target network that is a copy of the current network but updated less frequently. This target network is used to compute the target value y = r + \gamma \max_{a'} Q_{\text{target}}(s', a'), making training more stable.
The target network helps us avoid the issue of chasing a moving target because it provides a more consistent and stable learning.

