Some doubt about gathering data for reinforcement learning

i have some doubts about week 3 video - state value function. In the video andrew said that
we can get our data for NN like this :

i dont get the y1 = R(s1) + &*maxQ(s’,a’) part which later on the next slide it tell me to train Q multiple time to get Q = Qnew. How can we do that ? Here’s what i am thinking

i think that this’s the only way to gather our data, we start at some position (x,y) at root node and then try all possible action to get the next position and state, we keep trying everything like that until we get the final result (crash or successful landing). Now we have the value of the leaf node and we start to trace back to get the value of parent node, now we can get the data (x1,y1) from the lecture. Am i correct ?
One more question, if we got all the data we need so why we need to train multiple time (Q = Qnew) is it suppose to make anything better ?
Thank you for reading.

Hello @cpp219

To begin with, it’s impossible for us to exhaustively go through all possible state-action, because we are dealing with a continuous state space. We are only gathering as much as we can. We keep gathering data until it crashes, and then we restart and keep gathering data until it crashes again, and we repeat this on and on. We do not make sure we have exhaustively gone through every possible state-action because we simply can’t.

Your graph is not exhaustive either, so let’s say we have what you have drawn, and I have added something to it to help me explain:

Step 1, we pick a SARS sequence.

Step 2, we mark the starting state of this pair as s^{(1)}

Step 3, we mark the action taken as a^{(1)}

Step 4, we mark the reward we have got from the state as R(s^{(1)})

Step 5, we mark the consequent state as s'^{(1)}

Step 6, Given s'^{(1)}, we go to our Q-network, plug s'^{(1)} in as the input, and then select the maximum Q-value (that corresponds to taking the action a')

Step 7, we compute y^{(1)}. Now we have the (x^{(1)}, y^{(1)}).

Step 8, we go back to step 1 and take another SARS sequence and do step 2-7 again. From your graph, we can get a total of 16 samples because there are 16 SARS sequence.


i still don’t get what you mean.
In step 6, we go to our Q-network, plug s′(1) in as the input, so we get the data at the same time when we’re training neural network ? i think that we can only start to train our neural network when we have all data we need because we cant evaluate loss without correct data isn’t it ?
furthermore, isn’t it R(s) always = 0 when s is not final state - landing or crash, (R is the immediate reward right ? ).


Yes too. But all the data we need does not mean we have exhausively gone through all the state-action. It only means sufficient data. Let say we want to train our model with 1000 SARS sequences, then your “all data we need” means 1000 SARS sequence.

@cpp219, we are actually really using the Q-network before it has been well trained. In other words, we are already using it even when it’s still pre-mature. This is how reinforcement learning is different from the usual supervised learning where we have a full dataset in prior, train the model, and make predictions.

Reinforcement learning is a process that we allow the model to learn along the way while we use it (to make predictions from it). When the Q-network is randomly initiated, we use it and allow it to learn. Then the Q-network is somewhat better after some learning, we continue to use it and continue to let it learn.

Learning while using lets it learn through interacting with the environment. Such arrangment lets it learn where goes wrong and make improvements.

What we hope for is that the Q-network will finally converge to a model that can handle all situations and finally land the lunar lander safely. It is achieved by going through some situations while it is still not well trained.

We don’t wait until the Q-network to be perfect before using it or making predictions from it. Therefore, we don’t exhaustively collect all possible state-action to train a perfect Q-network first before using it. This is reinforcement learning.

No, it is not always that. It depends on the environment. The lunar lander’s environment is an environment that making progress (getting closer to landing pod) will be rewarded. In a real situation, maybe only a good landing is considered successful, but when we are training a model, we want to create an environment that rewards correct intermediate decisions, because those decisions are prerequisites to a successful landing.

So, we create an environment that guides the Q-network how to make good decision by keep rewarding it along the way to a successful landing.


PS: In the assignment of course 3 week 3, you can observe all I have said above. It is the lunar lander assignment.

1 Like

it now make sense to me , thank you a lot for your help.
So in conclusion, R(s) is not always = 0 then we can start at some random state, take random action, gathering the data (according to pre-mature Q network) and re-train the Q network (it will figure out how to behave correctly because we give it some non zero immediate reward) by the data we just get right ?

@cpp219, yes, that is the idea!


1 Like