Not so clear the concreate difference between soft update and normal update

Here is the screen shot taken from the lecture Algorithm refinement: Mini-batch and soft updates (optional)

Given the formulas
1.W = 0.01*Wnew + 0.99*W
2. Wnew = W - alpha*(dj/dw)
if the 2i is applied and replace Wnew in 1, is would be turn to W = 0.01*(W - alpha*(dj/dw)) + 0.99W, so W = W - 0.01*alpha*(dj/dw). If this is the case then what is the concreate difference between it and formula 2 beside the parameter alpha shrink 100 times?

This is incorrect.

It will be Wnew = Wnew - alpha * (dj/dw).

Why it would be Wnew = Wnew - alpha * (dj/dw)? Isn’t that would be the new weight calculated by old weight minus learning rate times the derivated term?

In this part of the equation, the old weights are Wnew that was calculated at a previous time.

Hi @Feihong_YANG,

Take a look at the example below that we focus on exactly one of the weights in the Q-Network (and the same weight in the Target QN):

Let’s say that weight is initialized to 3 (in step 0). The first two rows are for the updates in the QN and the TQN, and the third row is for an imaginary QN that works in the way you have described. Let’s see what will happen.

In step 1, because the values of the weights are the same for both the QN and the imaginary QN, we get the same gradient values (of 2, in green color), and consequently, we see both of the TQN and the Imaginary QN updated to the same value as you have pointed out in your first post of this thread.

However, in step 2, because the values of the weights are no longer the same, their gradients cannot be guaranteed to be the same anymore, making the TQN and the imaginary QN to go in different paths.

By now, I am sure you can see that simply replacing the TQN with the Imaginary QN won’t get us the same result.

We also can’t replace two networks (QN and TQN) by one network (just the Imaginary QN) because we need two networks for two different roles. (For the roles I will refer you back to the lectures and not repeating them here).


Hey @rmwkwok

Great appreciate for this screenshot example and it resolve my confusion about the difference between soft update and imaginary solution. not so sure but I think it sort of share the similar insight of momentum theory introduced in Deep Learning Specialization which gradually reduce the impact of previous update and add in new experience carefully.

But I still have two concern regard the DQN solution.

  1. How can we prove the soft update outperform imaginary if we set the learning rate small enough like your imaginary QN example after maybe thousands or millions of steps given we already update carefully (small learning rate)

  2. This is a more general concern about the solution itself. I noticed the general algorithm processed like with the QN to direct the agent to do the exploration in the environment and generated a bunch of record, and TQN response to build the training set with the Bellman equation. After that we train our TN with the training set generated by TQN and compute the loss for weight update. But how can we assure this make sense? Or what’s the philosophy behind to drive this and believe it would lead to a good result? I can only get some insight from the Bellman equation itself that Q(s,a) = R(s) + gamma*maxa’(s’,a’) and since when we generating the training set we have R(s) which is a ground true value, so it’s believe to be more reliable that prediction through QN(s,a), and we believe the training set target value can be applied to be as an evidence for training.


Before discussing your questions, I actually had a few questions I wanted to hear your opinion. If you need to review the lectures (for Q1 and Q2), please take your time. Q3 definitely has nothing to do with the lecture, but through Q3 I should be able to see why you think the imaginery QN is a useful network.

  1. what relation do we have between the QN and the TQN?
  2. how that relation help? (or why is that relation useful?)
  3. what relation do we have between the QN and the imaginery QN?

Look forward to your reply.



Have just went back to the lectures and review the assignment code. Here is my understanding.

  1. At the beginning QN and TQN set the same parameters.
  2. During the experience generating process, QN response to drive the agent to do the environment exploration decision for next step.
  3. Once the QN finished the exploration and generated the experiences record, TQN response to do the training set build process that calculate the y values per Bellman equation and let QN to train on it. The trained result is applied to TQN with soft update strategy.
  4. The process is like QN response to drive explore the environment and try to pick up some experience. Since the target value y is generated with TQN and let QN to learn, the role for TQN here looks a little bit like teacher and QN is a student. After the student learnt something from the training set built by teacher, the teacher grab this experience to its own knowledge base carefully (soft update) and with the updated knowledge base to teach in the follow up learning process.
  5. Imaginary QN generate the training set by itself, not sure if it’s good to describe that it’s a kind of self-study behavior.

This is pure my own extended understanding beside the intuition provided in the lecture that the purpose of soft update is for learning process more steady and don’t let the worse update make too much impact on current potential better network.

Hello @Feihong_YANG,

Thank you for your feedback! You have basically given a description of how QN and the TQN work together to learn from the environment, and you have shared that they have a teacher-student-like relationship. A great explanation it is!

As a continuation of your explanation, I would like to just highlight another aspect of their relation as follow:

  1. the TQN also learns from the QN (because TQN’s weights are updated according to the QN’s weights through the soft update equation which you have also mentioned)

  2. QN learns to follow TQN’s y (as you have pointed out)

Therefore, comparing to QN+TQN, the QN+IQN (IQN: Imaginary QN) combination lacks that bi-directional interaction. While the IQN can still teach the QN, IQN doesn’t learn from the QN. Here is the problem is, where would you expect the IQN to get its y values from, without any TQN?



Great highlight the make the description more succinct!
Let me see if I can get your point:

While the IQN can still teach the QN, IQN doesn’t learn from the QN. Here is the problem is, where would you expect the IQN to get its y values from, without any TQN?

Can I understand it like TQN is learning from QN by soft update but IQN is just a copy of QN? Since if we set w = 1wnew + 0w then Q = Qnew.

Actually I deduced this teacher-student schema just from the learning procedure itself and it’s little bit hard for me to see the TQN outperform IQN since even in general TQN consider more about its previous statuses, this teacher student schema valid only if we can guarantee it accumulate previous exploration helping seasoned enough to be a teacher rather than worsen than the simply IQN.

I was trying to understand it with momentum schema, not sure if it can provide any insight about how we can guarantee this? Is there any hint?

Actually the IQN is easier for me to understand since per Bellman equation Q(s,a) = R(s) + gamma*maxa’(s’,a’), if we can explore a Q function to fit this equation then the problem solved, so we can simply treat it as a search problem.

I doubt. To me, your IQN is just another QN but of a smaller learning rate. Am I right? So, the QN is a faster learner and the IQN is a slower learner. The QN’s y target is obtained from the IQN (as I guess this is how you think it should work), but it is still unclear that where the IQN is going to get its y. Any idea about this part? I am afraid we need to face this question directly, before we continue to discuss about this IQN idea. So, where does the IQN get its y value for training?

You have illustrated how the weight update formula should look like for both the QN and the IQN. I think now we need to know where do they get their y values.

In the current QN + TQN scheme, QN gets its y values using the TQN, and the TQN doesn’t require any y value because it updates through the soft update formula.

Now, what about the QN + IQN thing?

Because if we cannot figure out how the IQN gets it y value, then IQN won’t work, and there will be no need to discuss whether IQN will outperform anything. Right? If we cannot figure that out, we don’t know what the IQN is supposed to learn to be, right?

Hi @rmwkwok
Correct me if I made it wrong, the idea of IQN in my opinion is introduced from the example of screenshot where we compare how the calculation would be different when the process going on and reach the step3 and follow up, and I think it is just the procedure of the algorithm introduced before talking about soft update, which mean it described in the lecture Learning the state-value function and in order to describe my Q1 I made the learning rate small, which mean

To me, your IQN is just another QN but of a smaller learning rate.

I agree.

For this,

To me, your IQN is just another QN but of a smaller learning rate.

if your QN here referring to the algorithm talked in the linked lecture, that’s right. In order to confirm we are talking about the same algorithm, let me list the procedure here and get rid of the term IQN and use number to label the network in different state:

  1. initialize w and b for QN, we call it QN_0
  2. Do the exploration and drive the agent using QN_0 with ϵ-greedy policy to generate a list of experience
  3. Build the training set and label y value using QN_0 with Bellman equation.
  4. Train QN_0 with the generated training set and weights updated, so we get QN_1
  5. Use the QN_1 and loop back to 2, keep the procedure so we get QN_2, QN_3…

To me the whole procedure described here, the key point is rather than how we generate y important, but how we can get a QN so we can make the Bellman equation valid. Once the equation valid we can say the learning problem solved and the model we get is exact what we want. Thus it’s easier for me to understand it like a search problem, we are searching in the weights space (w,b) to make the equation Q(s,a) = R(s) + gamma*maxa’(s’,a’) valid, and this is why I think the IQN make sense since every time it made the adjustment toward it.

And the question here I introduced in my previous comment is that since we introduced the soft update thus the situation seems to be more complicate, and the reason we introduce this per the lecture, I attached the transcript here

If you train a new neural network to new, maybe just by chance is not a very good neural network. Maybe is even a little bit worse than the old one, then you just overwrote your Q function with a potentially worse noisy neural network. The soft update method helps to prevent Q_new through just one unlucky step getting worse.

Actually I’m not so agree this, because we can still reduce the impact of worse update given learning rate is under control. But if you say once the learning rate set too small and progress delay significantly, we have to come up another solution to address the worse update, and we introduce the soft update strategy, then I can totally agree with the strategy once you can prove :

  1. Training set generated by TQN can help to counteract the worse update so we don’t need to reduce the learning rate alpha
  2. The y value generated by TQN is better than the one generated with the listed algorithm above and it help better for us to search (w,b) to solve the Bellman equation

Not sure if this make sense since this also resolve my Q3 once proved.

Hello @Feihong_YANG

Do you think this is equivalent to setting \tau = 1. Recall that image where image.

So, if we have \tau = 1, then we make TQN = QN in every soft update.

Of course, besides \tau = 1, according to you, we also need a small learning rate for training the QN. Currently \tau = 0.001 and \alpha = 0.001.

if setting \tau = 1 and \alpha = 0.000001 are sufficient, then you may prove your idea by modifying those two parameters in the Course 3 Week 3 assignment, and see how your lunar lander will become. Just two quick changes.

I understand that in your last reply you are looking for some proof that the QN + TQN is better than your idea, but I hope you will be the one to prove that your idea can work, at least in the lunar lander.


Hello @Feihong_YANG,

Because I am serious about your idea, and I want to give it a try, I have set \tau = 1 and \alpha = 0.000001, and keep everything else unchanged. It took less than 20 minutes for the whole notebook to complete running.

Below is the video generated in the last step of the assignment, and that lunar lander couldn’t land properly.


Also, throughout the training process, it never gets a positive total point average.

If you want to defend your idea, you may tell me how I can modify the assignment’s parameters and train a lunar lander that works.


@Feihong_YANG, I believe that according to you, both the TQN + QN approach and your idea can solve the problem, because both of them use a valid equation - the Bellman equation. I guess you just don’t think that the TQN is necessary, and without the TQN, things become simpler, and you can better understand it.

Therefore, I hope you will do experiments on the assignment notebook, and show to both you and me that how getting rid of a separate TQN (by setting \tau = 1) can still give you a working lunar lander. My first attempt to do so has failed.

However, if your experiments convinced you that the TQN was necessary, then we could focus our discussion on how the TQN may help.


@Feihong_YANG, I believe that according to you, both the TQN + QN approach and your idea can solve the problem, because both of them use a valid equation - the Bellman equation. I guess you just don’t think that the TQN is necessary, and without the TQN, things become simpler, and you can better understand it.

Hi @rmwkwok
Actually I prefer to say a little bit lost of the philosophy behind the idea of TQN rather than don’t think that the TQN is necessary, since it looks not so intuitive.

I did try to finish this experiment, given I have already completed this specialization for now I don’t have access to the notebook, I downloaded the code from website this afternoon and try to run it but failed due to package issue, could you help to just run it with a normal alpha value for Adam optimizer? Maybe the same as it for TQN in the assignment with 1e-3, and see if it converge within 600 episodes? Since extremely small alpha value is not required if we accept to take the risk of worse impact from any mini batch sample. If it didn’t beat the TQN and we can guarantee that at least the experiment shows TQN is a better stratey.

Hello @Feihong_YANG,

Data Science is, to me, a practical science. I am sorry that I think it would have to be you to do the experiments, because only then you can observe, and try to come up with explanations. Also, even if you are able to come up with one set of parameters that happens to be able to train a working lunar lander, does just one success speak everything? Won’t you then need to conduct more experiments?

If you just think that TQN is not intuitive, then we can try to discuss about TQN. However, if you propose a different way, it is better that you have experimented it.

Shall we just talk about the TQN? Besides the lectures, did you read anything else that you can find on the internet about TQN and soft update? Maybe some online discussions or some short articles?


For example, generally speaking, our robot might have to explore in a very large state space, too large that it is impossible for the memory buffer to remember every single step the robot has explored. Since each time our QN is trained on a limited and recent memory, without a TQN (which is reluctant to recent change), the QN can keep being refreshed to recent states and very soon forget about the old past. What do you think?


Thanks for providing this insight, this is exact my first intuition while understanding this algorithm and posted

this teacher student schema valid only if we can guarantee it accumulate previous exploration helping seasoned enough to be a teacher rather than worsen than the simply IQN.

And I agree with that it’s a practical science, maybe even the author of the TQN algorithm can only guarantee through the experiment rather than some kind of mathematical equation to prove it outperform IQN, will try to rebuild my local messy env again and see if I can run it.

Sure. When everything is set up and ready, then you might test that idea by adjusting the memory buffer size. Perhaps see how \tau has to change with the size?