Can someone tell me if my intuition is correct or not.
What I think I learned from this video is that we will create two neural networks
i) The first one will calculate all possible values of Q(s’(a),a) for values of Q(s(a),a). ie if the lander is in some position (x,y) at some orientation and we decided to fire any one or multiple thruster as action. it will calculate our new state and reward we will get due to our action and store it as Y. And we will generate many tuples like this and store 10,000 of such example.
ii) Now we will give this output tuple as input to second neural network and it will try to train itself using the bellman equation tune itself to take the best action at any situation it is in. ie at any time if it is in some position (x,y) at any orientation it will be able to take best action as it has seen may example of being at that position and will be able to know what best action to take.

if my intuition is correct which I think shouldn’t be correct. why are we using the first neural network as we can calculate these thing using old programming techniques like running a loop.

I have read the previous discussion which was on similar topic about DQN network. I wasn’t able to get that discussion too. I have watched the video several times now. Please help

Probably the first one is the objective function that will give you state and reward.
In lunar lander it is mathematically done but say in real world the collect peta bytes of data and training a neural network for objective function makes sense. As it maps collected data to space and objective function. Look at game playing algorithm then you get a clear idea about space, step and action.

Are there two neural networks ?? Or we are calculating all possible steps in first step using some loop. I am not getting how they are doing this step.

Where can I see this. Is this some video. can you tell me about this

For DQN, there is only one neural network, but we might have 2 “copies” of it during training.

There is no neural network that calculates the initial dataset (of 10,000 in your example). The initial dataset is generated using a simple loop.

Inside this loop, you run inference on that one neural network to generate the samples.

This “second” neural network is the DQN. It takes the generated input and trains on it. One thing to note is that we create a “copy” of this second neural network to train on. This “copy” has the exact same structure and params, and we update the gradients on this copy every certain number of steps rather than the original (to help the training converge).

Your first intuition was incorrect. Your second intuition is correct: we do indeed calculate the input using a loop for DQN.

Aside: there are systems that utilizes more than one neural network, but just not for DQN. For example, Generative Adversarial Networks (GANs) does something similar to what you describe (ie. one neural network generates samples to train a second neural network, and they go back and forth training each other until it converges).

So we can have multiple sequence of task which we perform after one task. eg we fired one thruster but we crashed. then we start again fired first and second thruster and we crashed. Then we start again we did the same task again but with new action. I am unable to wrap my head against this concept as how we can string all the possible task which we can take after doing one task. As once we fired one of the thrusters there are still 4 thruster which we can fire after that and again four more thruster. just this single sequence can generate 16 different sets of steps. I am confused on this part…

Again here did you mean the second neural network ie. the DQN

Yes, that is correct, the space of possible combinations is massive, and so it hard to imagine it working. DQN does not attempt to try out all those combinations, it just slowly prefers the states/actions that eventually lead to higher rewards (and that preference propagates backwards in time).

I think it’s better to think of the DQN training process as slowly preferring the steps that eventually lead to a higher reward in the end. Say you have this chain of events that eventually lead to a reward:

In the beginning, the training would really only start affecting the Q value output for the later states, like e and d. The model would have a slight preference for the e and d states. It doesn’t have a preference for a, b or c states, so the Q value output for those would still be random.

As you train it more, then it would start to prefer states that lead to states e and d, and so it would start preferring c and assigning a higher Q value for that.

When talking about generating inputs, it’s better to think about there only being a single network. I don’t know what you mean by “second” neural network here.

Note that the generated inputs don’t actually include the estimated Q value, it’s only (state, reward, action, next_state). The neural network is only used to populate the generated inputs with more actions that currently seem to be more favorable.

The “copied” target network really only applies during the training step.

and we are giving (s,a) as input and calculating Q(s,a) and seeing how close it is to reward Y so that we can mark each step we have taken is good or bad…