How does the Q-Learning Algorithm actually learn?

Here’s the algorithm from lecture:


So we are initializing Q to be some random function. Then we generate 10000 examples of x = (s,a). That’s fine. It’s the next step that’s absolutely crazy to me. We generate y = R(s) + ymax(Q(s’, a’)). In other words, we are using Q, which is at this point a completely random function, to generate the target data. Why is this target data not just complete junk? It’s generated using a random function. Then, we train Q_new on this data such that Q_new(s, a) ~= y. How is Q_new supposed to be any closer to the ideal Q* than Q after training it on junk data?

What is forcing Q to improve towards the ideal, Q*? I don’t get it.

1 Like

Hello @James_Goff,

We need to note that the so called Q function is all about representing the Reward, and Reward is what is being injected into the Q function over the training steps. The randomly initialization of Q shouldn’t be the only focus.

To convince yourself, I suggest you to implement the algorithm. This is an example from the lecture “State-action value function definition”:


where the red numbers are the optimal Q(s, a).

You might setup the problem with

Q = np.random.normal(size=(6, 2))
R = np.array([100., 0, 0, 0, 0, 40])
A = np.array([-1, 1])
gamma = 0.5

Note that Q is also randomly initialized. Then write a loop to do two things at each step: (1) sample one (s, a, R, s’) (2) update Q with that formula in the algorithm if s is not a terminal state, or update Q with just R if it is at the terminal state.

It should converge to the optimal Q within 100 iterations. Since it is a handle-able number of iterations, you can actually print the steps out and examine them. If you initialize Q to zeros instead, you can also see how the reward “spread out” from the terminal state into the middle states.

Good luck.


Couple questions:

  • Why are you intializing Q to a number? Isn’t Q a function (in this case a neural network)?
  • How to sample (s, a R, s’) in this example?

The purpose of the neural network is to accept an input and to produce an output. Making it an array serves the same purpose. An array accepts state and action as input (index values), and produces an output (array element).

This example has a very very small state action space (of 12 state action combinations only), so a neural network isn’t necessary at all.

The array form provides a simple approach for us to focus just on the evolution of Q, instead of all those details of training a network. However, after studying thoroughly with the array form of Q, you might look into how to switch it to a neural network yourself (or just do the week’s assignment). We will focus on the array form here.

For s, you might random sample it every time.
For a, you might consider yourself adopting the epsilon greedy algorithm (taught in lecture) with epsilon being set to 1.
For R and s’, what should they be? I think you can figure them out if you know what they mean?

Hey, did you end up understanding this? because I still cannot quite get how it learns. It isn’t making sense to me.
I also tried searching the web, but each source has a different way of implementingb the algorithm, so the more I searched the more I got confused.

Hello @nasseralbess,

Share what you have understood so far and we will see where to continue.


Look the next link to other thread: Unsupervised Learning : Week3 : Learning the state-value function - #3 by Luis.BR

Hello everyone,

I have a question regarding this and don’t want to start a new thread.
My question basically is, how the 10000 training examples relate to each other. Are they consecutive steps? So that s(1)’ = s(2)? If so, what does the lander do if it ends up in a terminal state? If not, does it start from a totally random state for each initially generated training example? Do we limit the possible range of s somehow, so that it cannot start too far away from the landing zone or below it or something like that?

Best regards


I have the same question as well!

It is your decision, and your approach. They can be consecutive, or they can be not.

Then you probably can’t collect more samples afterwards, because it is terminated.

We want to make sure the samples are relevant to what we want to predict for.

Btw, starting a new thread sometimes helps your reader.


Hello Raymond,

thanks for your reply.
I will start a new thread next time, but because someone else mentioned, that they have the same problem, I will continue here.

I am stuck in video “Learning the state-value function” right now and I am trying to understand what Mr. Ng is doing. You say “It is your decision, and your approach. They can be consecutive, or they can be not.”, but I am not at that point, yet.
My question is, “What is Mr. Ng doing there ?”.

Best regards


Where is there? Can you share the source (e.g. video name + time mark) so that I can take a look?

Of course, the video is:
and what I am confused about is the process starting at about 6:00 and going on until 15:00.

Thanks for your quick reply.

So, he only said to store 10,000 most recent tuples, which means very likely some of them are consecutive but if the 10,000 tuples involve multiple episodes, then you may also find some “discontinued” points among the tuples. For example, you can have 2,000 consecutive tuples from episode one, then 1,500 consecutive tuples from episode two. Obviously, we can’t expect the last tuple from the first episode and the first tuple from the second episode to be consecutive.


Then he said to create a training set of 10,000 examples which obviously use all of the buffered tuples.

However, this is just one way to do it. You can actually store 10,000 tuples but only sample 2,000 of them for your training, as long as you prove that it is more beneficial to do so - now this makes it your approach and your decision.

Once you sample, it does not really matter whether they are consecutive or not. Actually, when training a neural network, it does not in any way recognize the “consecutiveness” of the samples because each sample is treated independently, so why care the consecutiveness?

I actually would suggest you to google for more discussions about the “consecutiveness” on whether it is always a good practice to follow, or there may be some down side.

Ok, that helps me alot. I do not really care about the consecutiveness, I just needed to understand if its important or not.

Let me explain my understanding of the process and you tell me if I am mistaken at some point, ok?

So we have a model Q which we initiated randomly. I assume there are some restrictions given the fact that we now something about the lunar lander environment, but basically its random.
Now we create a bunch of training examples x comprising of a random state and a random action.
We then calculate Q(s,a) based on that model and a given training example.

Now we have learned some actual information from the “real world” which is R(s), an actual reward, that we got from being in state s. This actual real world information is then combined with the gamma * max(a’)Q(s’ ,a’) to calculate y.

As far as I understand the second part of that sum is completely made up because it depends on the random model Q, except for the fact, that we use the actual state s’ as input, that we got to by taking action a in state s.

Now that we calculated all this, we have something to learn from because there is a difference between Q(s,a) and R(s) + gamma * max(a’)Q(s’ ,a’).

The reason for that difference is, that if we had simply put x, i.e. (s,a), into Q, Q(s,a) had assumed some value for R(s) that is totally random, based on its initialization. But if we actually take an action a, we get a value R(s) from the real world, that is most likely different from the value R(s), that Q would have assumed on its own.

This difference is our information gain on which the whole learning process builds up, because we can now compare y the output of the model including the true value R(s) with yhat, the output of Q without knowledge about the true R(s).

Am I mistaken somewhere?

Hello @Fabian_Harder

“we now something” → “we know something” ?

I can’t imagine how we initialize the model given any knowledge about the env. If we know something, we can initialize it randomly and then modify it later on. By restriction, I believe you are referring to some constraints that are going to apply throughout the training process, such as something like “all weights have to be positive at all time”. If that is the kind of restriction in your mind, then I can hardly imagine any case at this moment why we would want it. Maybe in some cases we would want it, but right now, at this moment, I just can’t think of any reason for that. So, I have reservation here.

Yes. Random is the usual starting point.

Well, I would not say “random action”. It is not that you are wrong because we need random action, but this is NOT all we need.

If you go thru the lectures, we have the so-called “epsilon-greedy policy” which lets the action be sometimes random and sometimes based on the Q function.

You may think that the Q function is nothing better than random because it is initialized randomly, but it is only the case at the beginning. After some time, the Q function will start to be better than random, so at that point, we have sometimes random actions and sometimes better-than-random actions.

(Now that i have finished reading your whole post, I think you meant to just focus on the very initial stage of Q when it is nothing better than random, while my above response is more throughout the whole training process. Therefore, it is fairer for you that I take back everything I said here, but I choose to leave it here just for your reference)

Very good point.

Yes! Yes!

No. :wink:


1 Like

Amazing, I have no questions left, thank you.

You are welcome, @Fabian_Harder :wink: