Question about DQN learning

benmore · July 4, 2024, 1:39am

Learning the state-value function | Coursera

In the lecture linked above, it is implied that the randomly initialized parameters of the neural network will improve over time, even though the y-value Q(s,a) of the training examples on which the neural network is trained at each iteration is approximated using the outputs of that neural network.

How is it possible that you can improve the predictions of a neural network using an artificial training set made from that very neural network’s predictions?

TMosh · July 4, 2024, 5:23am

It seems like magic, and it pretty much is. I don’t pretend to understand how it works in any detail.

But it’s guided by the fact that there is a measured goal to achieve, so the cost can be minimized.

gent.spah · July 4, 2024, 8:57am

As long as the output of the NN and it ground truth (which maybe be an approximation of the output but not exactly) are different then the NN will try to push forward to minimize this difference!

benmore · July 5, 2024, 2:18am

Thanks for the responses—I’ll look into this further.

Nevermnd · July 5, 2024, 2:50am

Unfortunately I did not take the ML course here, so I do not have access to the direct video you are referencing.

But even a bit older than deep learning, in traditional ML, the thought that you in some way ‘retrain the model’ (or a second model, then ensemble) on the instances the first got wrong is really not all that revolutionary or new-- But, as always, the really big and key thing in the end is to avoid overfitting.

Personally, I think this is perhaps one of the cruxes of latest model development.

But the basic idea is not all that wrong-- If you practice for a quiz (like an IRL quiz), and get some questions incorrect (or more strictly, you train on the errors), aren’t those the same types of problems you want to practice more on (but hopefully not the exact same ones) ?

rmwkwok · July 5, 2024, 8:38pm

Hello, @benmore,

Firstly, an important fact to always keep in mind is, we don’t train the model on just its own outputs, but also the rewards R. The outputs might be random and wrong, but the rewards are not.

Secondly, an interesting property we can observe from the Bellman equation is about the importance of the reward R:

If you repeat the substitution for many times, in the resulting equation, R take up all of the terms except for the last, heavily “penalized” Q term, which means that if the network is capable to learn the correct R(s) values in the different states s along a state path, with the decaying factor \gamma^n, the error by the last term is going to be not significant.

In other words, the rewards matter because they should change the outputs of the model from “some random values” to “rewards + some random values”. This change is significant.

What do you think?

Cheers,
Raymond

benmore · July 7, 2024, 11:26am

So, for instance, when we are creating an artificial training example from the tuple (s^{(1)},R(s^{(1)}),a^{(1)},s'^{(1)}) to train the DQN on, we don’t just simply approximate the training example’s y-value, Q(s^{(1)},a^{(1)}), as the predicted output of a randomly initialized DQN, but rather approximate it as y^{(1)} = R(s^{(1)}) + \text{the DQN's random guess for } \max_{a'}Q(s',a'), given that we actually know the value of R(s^{(1)})?

rmwkwok · July 12, 2024, 2:37am

Hey, @benmore,

Sorry for getting back late, but yes, your last statement is a perfect description of what’s going on, and you will see (or you might have seen) the same thing in the week’s assignment.

Cheers,
Raymond

benmore · July 13, 2024, 5:34am

Great, thanks for pointing that out! I also forgot to multiply \max_{a'}Q(s',a') with the discount factor \gamma, given that we also know the value of the discount factor as well.

rmwkwok · July 13, 2024, 7:10am

Oh! I thought you had absorbed the gamma there. To be more precise, it is random guess only when the DQN is at initial stage. As we train the model on, it will progressively become more and more informed predictions.

Topic		Replies	Views
Question about state value function learning algo Unsupervised Learning, Recommenders, Reinforcement week-3	4	520	April 19, 2023
Deep Reinforcement Learning Unsupervised Learning, Recommenders, Reinforcement week-3	1	499	January 2, 2023
Confusion on Target Variable Deep Reinforcement Unsupervised Learning, Recommenders, Reinforcement week-3	28	932	September 15, 2022
Reinforcement Learning Unsupervised Learning, Recommenders, Reinforcement week-3	1	72	July 1, 2024
What helps the Neural Network in the Lunar Lander example improve? Unsupervised Learning, Recommenders, Reinforcement week-3	2	297	June 6, 2024

Question about DQN learning

Related topics