Hi @rmwkwok

Correct me if I made it wrong, the idea of IQN in my opinion is introduced from the example of screenshot where we compare how the calculation would be different when the process going on and reach the step3 and follow up, and I think it is just the procedure of the algorithm introduced before talking about soft update, which mean it described in the lecture Learning the state-value function and in order to describe my Q1 I made the learning rate small, which mean

To me, your IQN is just another QN but of a smaller learning rate.

I agree.

For this,

To me, your IQN is just another QN but of a smaller learning rate.

if your QN here referring to the algorithm talked in the linked lecture, that’s right. In order to confirm we are talking about the same algorithm, let me list the procedure here and get rid of the term IQN and use number to label the network in different state:

- initialize w and b for QN, we call it QN_0
- Do the exploration and drive the agent using QN_0 with ϵ-greedy policy to generate a list of experience
- Build the training set and label y value using QN_0 with Bellman equation.
- Train QN_0 with the generated training set and weights updated, so we get QN_1
- Use the QN_1 and loop back to 2, keep the procedure so we get QN_2, QN_3…

To me the whole procedure described here, the key point is rather than how we generate y important, but how we can get a QN so we can make the Bellman equation valid. Once the equation valid we can say the learning problem solved and the model we get is exact what we want. Thus it’s easier for me to understand it like a search problem, we are searching in the weights space (w,b) to make the equation **Q(s,a) = R(s) + gamma*maxa’(s’,a’)** valid, and this is why I think the IQN make sense since every time it made the adjustment toward it.

And the question here I introduced in my previous comment is that since we introduced the soft update thus the situation seems to be more complicate, and the reason we introduce this per the lecture, I attached the transcript here

If you train a new neural network to new, maybe just by chance is not a very good neural network. Maybe is even a little bit worse than the old one, then you just overwrote your Q function with a potentially worse noisy neural network. The soft update method helps to prevent Q_new through just one unlucky step getting worse.

Actually I’m not so agree this, because we can still reduce the impact of worse update given learning rate is under control. But if you say once the learning rate set too small and progress delay significantly, we have to come up another solution to address the worse update, and we introduce the soft update strategy, then I can totally agree with the strategy once you can prove :

- Training set generated by TQN can help to counteract the worse update so we don’t need to reduce the learning rate alpha
- The y value generated by TQN is better than the one generated with the listed algorithm above and it help better for us to search (w,b) to solve the Bellman equation

Not sure if this make sense since this also resolve my Q3 once proved.