In the lecture linked above, it is implied that the randomly initialized parameters of the neural network will improve over time, even though the y-value Q(s,a) of the training examples on which the neural network is trained at each iteration is approximated using the outputs of that neural network.

How is it possible that you can improve the predictions of a neural network using an artificial training set made from that very neural network’s predictions?

As long as the output of the NN and it ground truth (which maybe be an approximation of the output but not exactly) are different then the NN will try to push forward to minimize this difference!

Unfortunately I did not take the ML course here, so I do not have access to the direct video you are referencing.

But even a bit older than deep learning, in traditional ML, the thought that you in some way ‘retrain the model’ (or a second model, then ensemble) on the instances the first got wrong is really not all that revolutionary or new-- But, as always, the really big and key thing in the end is to avoid overfitting.

Personally, I think this is perhaps one of the cruxes of latest model development.

But the basic idea is not all that wrong-- If you practice for a quiz (like an IRL quiz), and get some questions incorrect (or more strictly, you train on the errors), aren’t those the same types of problems you want to practice more on (but hopefully not the exact same ones) ?

Firstly, an important fact to always keep in mind is, we don’t train the model on just its own outputs, but also the rewards R. The outputs might be random and wrong, but the rewards are not.

Secondly, an interesting property we can observe from the Bellman equation is about the importance of the reward R:

If you repeat the substitution for many times, in the resulting equation, R take up all of the terms except for the last, heavily “penalized” Q term, which means that if the network is capable to learn the correct R(s) values in the different states s along a state path, with the decaying factor \gamma^n, the error by the last term is going to be not significant.

In other words, the rewards matter because they should change the outputs of the model from “some random values” to “rewards + some random values”. This change is significant.

So, for instance, when we are creating an artificial training example from the tuple (s^{(1)},R(s^{(1)}),a^{(1)},s'^{(1)}) to train the DQN on, we don’t just simply approximate the training example’s y-value, Q(s^{(1)},a^{(1)}), as the predicted output of a randomly initialized DQN, but rather approximate it as y^{(1)} = R(s^{(1)}) + \text{the DQN's random guess for } \max_{a'}Q(s',a'), given that we actually know the value of R(s^{(1)})?

Sorry for getting back late, but yes, your last statement is a perfect description of what’s going on, and you will see (or you might have seen) the same thing in the week’s assignment.

Great, thanks for pointing that out! I also forgot to multiply \max_{a'}Q(s',a') with the discount factor \gamma, given that we also know the value of the discount factor as well.

Oh! I thought you had absorbed the gamma there. To be more precise, it is random guess only when the DQN is at initial stage. As we train the model on, it will progressively become more and more informed predictions.