Hello @syedaskarimuslim
That’s great! Reconizing the limitation of one approach is a key step to move on to the next. I had been waiting for you
Since the first time I read your approach and then your subsequent replies, it kept reminding me of the Mars Rover example presented in the lecture. In that example, we never needed any Target Q Network, or Target Q Anything. We never. And that is because the problem is so simple that we can enumerate all possible actions and results, then fill in the Q values according to the Bellman equation.
However, things become different in the Lunar Lander example - it is two-dimensional (Mars Rover is 1D); it concerns energy-saving (for thrusting will reduce rewards); it has infinitely many states (versus only 6 states for Mar Rover’s ); we cannot accurately reach a particlar next state because the thrusting force is not adjustable.
I had been thinking that your “straight line approach” represented an attempt to simplify the problem into a 1D problem. I think it was a good try, so I discussed that with you seriously. I would like both you and I to find out to what extent was your approach practical. Recognizing the limitations - both you and I - has been my primary objective.
In fact, things can go counter-intuitive easily. The famous Brachistochrone Curve shows us that the shortest is not the fastest, when time is a part of the objective.
Our Lunar Lander example is way more complicated than just considering time, but requiring less thrusting and soft landing, and, not to mention that, in principle, you can add more into the list of rewarding criteria, including time. Will we always be able to tell how an optimal path should look like, in order to simplify it into a 1D problem?
The answer is trivial.
This, together with the fact that the assignment shows that reinforcement learning with Target Q Network works without any presumed optimal path, should be very encoraging because it relieves us from having to tell what an optimal path should look like, and enables us to explore solution under any list of rewards criteria.
Yes. In the Mars Rover example, we have a very limited number of cases to try; but in the Lunar Lander example, we can have an unbounded number of cases to try, so I agree with you that, it may not be practical.
Before we move on, I would like you to know that, your approach, in my understanding, is not very different from the Mars Rover approach presented in the lecture, which is why I think it is not wrong, but I would also like to make sure that both you and I recognize that there are some limitations, and agree that, given that the assignment works, we should have some confidence in Target Q Network.
I won’t give you a complete answer, because it is your business to find out an answer that is convincing to you. However, I can give some of my understanding and some suggestions:
-
In the assignment, we are not deviating ourselves from the Bellman equation because we ARE taking as the target Q value. The only problem here is that the Target Q Network may be giving us some wrong values, especially during the initial stage when the Target Q Network’s parameters are merely randomly initialized.
-
However, while we don’t run away from the aforementioned key problem, we also shouldn’t forget that, every time, we generate a target Q/Y value, we are, to the best of our knowledge, following the Bellman equation, meaning that we are picking the that is a . It is imperfect and can be very wrong, but it is what we have to the best of our knowledge (encapsulated in the Target Q Network).
-
Then, the problem is, can we expect that, over the training process, the Target Q Network can converge to some almost correct network, if not a perfect one. Because, let’s be pramagtic, we can’t expect us to find out the True Q Network in any way, just as we never heard any practical machine learning to be 100% perfect. The best we can get is just some useful, workable Target Q Network.
-
So, can we expect that it will converge to a useful Target Q Network? That’s the thing you might start googling with. “Target Q Network”, “converge”, “gamma” - they are some keywords for the search, but you may need more to steer your research. I am sure you can find out many relevant discussions and articles since you are not the first one who wondered that.
Above is all I can share with you, and please consider the search work your only way to find your answer, and please, if you don’t mind, feel free to share your new understandings and findings so that we can discuss them, but for anything other than your findings, including questions, I may not have anything more to respond to.
If you can’t find anything now, then maybe you need to come back to it in the future, when you are more experienced.
Good luck, and I look forward to your findings.
Cheers,
Raymond