# Confusion regarding basic mathematics of DQN Algorithm

In reinforcement learning, return of a state-action pair is calculated using the following expression:

The above expression tells that to calculate the return ‘Q’ for a state-action pair, we have to calculate two parts i.e., ‘reward you get right away’ and the ‘return from behaving optimally afterwards’.

The first part i.e., ‘reward you get right away’ can be calculated from the reward function shown below:

The ‘reward you get right away’ is the reward of the state that you are presently in. Since we have an arbitrary state, we can check from the lunar lander simulation whether we have a leg grounded or if we are in a crashed state or we are in a landed state and thus we can compare our arbitrary state with the above reward function to get ‘R(s)’ i.e., the ‘reward you get right away’.

As for the second part i.e., ‘return from behaving optimally afterwards’, we don’t have a defined policy. So, how would we calculate the second part without a policy?

To compute it in the above way, we need to know the values of Q for each pair of s and a in prior. The lecture’s simple Mars Rover example should have demonstrated that.

In the more complicated Lunar Lander example, the Q is not known in prior, and, in fact, Q is the very thing we need to solve for. The step should be:

1. Learn a Q(s, a) function given the Reward function (that you showed) and the Bellman equation (that you also showed)

2. After the learning, compute Q(s, a) right the way, without the Bellman equation and without breaking it into two steps.

For how to learn the Q function, check out this lecture and the assignment.

Cheers,
Raymond

However, you explained how the ‘Q-Network’ in the lectures and the respective notebook is trained.

My question was based on the ‘Target Q-Network’ in the final Jupyter Notebook of the reinforcement learning week 3 (Notebook Name is “C3_W3_A1_Assignment”).

I don’t understand why a ‘Target Q-Network’ is needed from a conceptual standpoint.

Every Neural Network ‘estimates’ a value. In above referred notebook, we are using a ‘Target Q-Network’ to output Q(s,a). This Q(s,a) is being considered as true-Y values.

This true-Y value will be used to train the main Q-Network that will predict Q(s,a) for given state-action pair.

Thus, it is actually our true-Y values that play central role in training the main Q-Network which will ultimately be used for predicting Q(s,a) without directly using the bellman equations.

My confusion is regarding using a ‘Target Q-Network’ to output true-Y values. True values are not supposed to be created. They’re supposed to come from real world data. In our case, our simulation of moon lander used in the notebook is an alternative to real world data.

So, why are we using a neural network to populate true-Y values in the training set?

Keeping above arguments in view, the overall reinforcement learning algorithm in the notebook is working in the following way:

1. We use moon lander simulation to get SARS tuples. This is an alternative to real world data. From this SARS tuple, we’ve got our ‘x’. It is a vector having state-action pairs i.e., x = (s,a). However, training set is complete without a ‘y’.

2. We create a ‘Target Q-Network’ to generate ‘y’. It means, our ‘y’ is an estimate and not actually a true value. (An actual true value of ‘y’ would need return from behaving optimally starting from state s’ which is not possible without having a policy.)

3. We create ‘Q-Network’ and to estimate true y-values generated above i.e., the Q-Network is effectively ‘estimating the estimated value of y’.

Thus, I’m confused that we never used a true-Q(s,a) to train our neural network due to which our Q-Network is trained to ‘estimate an estimate’.

Before going any further, I have a question for you: assuming that we don’t have the Target Q Network, how can we get the true target Y values [that we use for successfully training a useful Q Network]?

I am not expecting for answer like “return from behaving optimally starting from …” because I cannot convert this sentence into my true target Y values [that we use for successfully training a useful Q Network].

I am asking, step-by-step, how can we get the true target Y values [that we use for successfully training a useful Q Network]?

If you think there is no way, because we do not know the real true Y values in prior, then it is fine for you to say there is no way. I do not know either.

If you want to do some research on why Target Q Network can provide us some true target Y values that we use for successfully training a useful Q Network, then I would be very happy to see what you find Learning does not stop with a question, right?

So, two things:

1. How can we get any true target Y values without the Target Q Network? And does it comply with the condition - “return from behaving optimally starting from …” - that you are requiring for the Target Q Network? However, it’s fine to say that there is no such way.

2. Any findings in your research?

Cheers,
Raymond

PS1: I keep using “that we use for successfully training a useful Q Network” to post modify the term “true target Y values” because we know that the Target Q Network is changing too. However, how can the truth be changing? Therefore, we know that we are not expecting the Target Q Network to produce the truth during the training process, but a useful true target Y values for doing gradient descent.

PS2: From this moment on, I will not use the term “true y values”, but “target y values”. I hope you will do the same

We are using simulation for generating the SARS experience tuples. It means that our simulation captures the physics of the moon lander. Afterall, SARS tuple itself is generated from the simulation because the simulation is able to tell how would the position of moon lander change if we fire a thruster in any direction.

Therefore, using the simulation, we may generate a multitude of sequences of landing the moon lander successfully from a multitude of initial states such that the moon lander follows a straight line path from its initial state to the landing path.

Since the moon lander follows a straight line path to landing pad from wherever it was initially in space, it can very well be considered as “* behaving optimally starting from …*”.

Following above idea, a single optimal sequence would contain several SARS tuples. However, in the end of each sequence, we’d be able to calculate not only the reward right away but also the return for behaving optimally starting from a state s’.

Conclusively, we’d be able to calculate Q(s,a) from the bellman equation with actual numbers obtained from the simulation runs for both parts of the equation i.e., the R(s) part and the max-gamma*Q(s,a) part.

Now instead of having 10,000 SARS tuples, we could have 10,000 optimal sequences tuples as the training data for the useful Q-Network. It would eliminate the need for having a Target-Q Network altogether.

I’m researching it and actually this is exactly the main thing I’m trying to understand that how are we using a Neural Network to generate target Y values.

It is because my understanding of the neural network is that we give it real world data and it fits a complex n-dimensional curve/graph/pattern to that data during the learning process. Later, its predictions are accurate in the real world because it had approximated the pattern behind inputs and outputs of the real world.

Now if the neural network has been trained on a data that itself is guess work, I can’t get how the overall DQN algorithm works (I know it works as demonstrated in the Jupyter Notebook of the course. I just can’t get how).

However, I’m researching on it and I’d appreciate if you could shed light on it also.

This is an extremely huge data set. You would have to simulate every possible trajectory from every possible point, using every possible action. One might say, it’s not a practical method.

Training a Q network to do the same job is more efficient.

The simulation only knows the physics, but DOES NOT know how to operate a lunar lander.

How can the lunar lander know what actions are needed in order to fly a straight line towards the landing pad, so that it can generate sequences of SARS tuples for those straight paths?

(If you need a Q Network in your answer, then let’s assume the Q Network is just initialized randomly. Also, until you agree that a Target Q Network is necessary, I expect you to not mention it at all.)

Raymond

PS: The above question does not imply that I agree with you that a straight path is an optimal path. A straight path may be the shortest path, but who says optimal = shortest? Optimality is determined by the highest reward, not the shortest distance. From the list of rewards you shared in your first post, there is no sign that the shortest distance gives the highest reward. I will not go into this now. Now I only want to focus on the above question. But you can change your position any time, just please give a new and complete story.

@rmwkwok

I don’t have any other alternative to getting target-Y apart from using the simulation. If not a straight line, we may have to go through thousands and thousands of sequences of human-controlled landings (like playing a video simulation game) to generate the SARS sequences and get y-targets. I am sure it may not be practical.

However, how a Target Q-Network solves the problem is not clear to me.

One difference is that all of your resources are going into training the DQN, rather than 1) creating a gigantic library of synthetic (or recorded) training examples, and then 2) running the examples through a complex training algorithm for sufficient iterations to get an optimized solution.

Such a training set would be impractically large and expensive.

This is the situation for which the DQN was invented.

That’s great! Reconizing the limitation of one approach is a key step to move on to the next. I had been waiting for you

Since the first time I read your approach and then your subsequent replies, it kept reminding me of the Mars Rover example presented in the lecture. In that example, we never needed any Target Q Network, or Target Q Anything. We never. And that is because the problem is so simple that we can enumerate all possible actions and results, then fill in the Q values according to the Bellman equation.

However, things become different in the Lunar Lander example - it is two-dimensional (Mars Rover is 1D); it concerns energy-saving (for thrusting will reduce rewards); it has infinitely many states (versus only 6 states for Mar Rover’s ); we cannot accurately reach a particlar next state because the thrusting force is not adjustable.

I had been thinking that your “straight line approach” represented an attempt to simplify the problem into a 1D problem. I think it was a good try, so I discussed that with you seriously. I would like both you and I to find out to what extent was your approach practical. Recognizing the limitations - both you and I - has been my primary objective.

In fact, things can go counter-intuitive easily. The famous Brachistochrone Curve shows us that the shortest is not the fastest, when time is a part of the objective.

Our Lunar Lander example is way more complicated than just considering time, but requiring less thrusting and soft landing, and, not to mention that, in principle, you can add more into the list of rewarding criteria, including time. Will we always be able to tell how an optimal path should look like, in order to simplify it into a 1D problem?

This, together with the fact that the assignment shows that reinforcement learning with Target Q Network works without any presumed optimal path, should be very encoraging because it relieves us from having to tell what an optimal path should look like, and enables us to explore solution under any list of rewards criteria.

Yes. In the Mars Rover example, we have a very limited number of cases to try; but in the Lunar Lander example, we can have an unbounded number of cases to try, so I agree with you that, it may not be practical.

Before we move on, I would like you to know that, your approach, in my understanding, is not very different from the Mars Rover approach presented in the lecture, which is why I think it is not wrong, but I would also like to make sure that both you and I recognize that there are some limitations, and agree that, given that the assignment works, we should have some confidence in Target Q Network.

I won’t give you a complete answer, because it is your business to find out an answer that is convincing to you. However, I can give some of my understanding and some suggestions:

1. In the assignment, we are not deviating ourselves from the Bellman equation because we ARE taking as the target Q value. The only problem here is that the Target Q Network may be giving us some wrong values, especially during the initial stage when the Target Q Network’s parameters are merely randomly initialized.

2. However, while we don’t run away from the aforementioned key problem, we also shouldn’t forget that, every time, we generate a target Q/Y value, we are, to the best of our knowledge, following the Bellman equation, meaning that we are picking the that is a . It is imperfect and can be very wrong, but it is what we have to the best of our knowledge (encapsulated in the Target Q Network).

3. Then, the problem is, can we expect that, over the training process, the Target Q Network can converge to some almost correct network, if not a perfect one. Because, let’s be pramagtic, we can’t expect us to find out the True Q Network in any way, just as we never heard any practical machine learning to be 100% perfect. The best we can get is just some useful, workable Target Q Network.

4. So, can we expect that it will converge to a useful Target Q Network? That’s the thing you might start googling with. “Target Q Network”, “converge”, “gamma” - they are some keywords for the search, but you may need more to steer your research. I am sure you can find out many relevant discussions and articles since you are not the first one who wondered that.

Above is all I can share with you, and please consider the search work your only way to find your answer, and please, if you don’t mind, feel free to share your new understandings and findings so that we can discuss them, but for anything other than your findings, including questions, I may not have anything more to respond to.

If you can’t find anything now, then maybe you need to come back to it in the future, when you are more experienced.

Good luck, and I look forward to your findings.

Cheers,
Raymond

Thanks for the responses.

I’ll keep researching and share my findings in due course of time.

Sure! There is no need to rush a response. Learning is not like taking a time-limited test and everyone has their own pace. I would like to see some critical thinking in a thought-through discussion! I can wait.

Cheers,
Raymond