Learning the Q Function

I got the intuition as to how the Q function gets better over time. Initially, the agent takes random actions. It receives feedback in the form of rewards from the environment. The NN is trained and next time it does better and the time after that better still. But I can’t seem to grasp some of the details. For example, assume there are a total of a 100 states (discrete states, for simplicity). Now assume after some training the agent takes optimal action in 50 states and suboptimal action in 50 states. Say epsilon is 20%. So the agent listens to the Q function’s suggestion 80% of the time and takes random action 20% of the time. How would the Q function improve after this point with further training?

Let’s say using the above Q function you collect 10 new training examples. The Q-function will give you, say, 5 optimal values of Q and 5 suboptimal ones. So, the action suggested by the Q function for 5 datapoints is a good one. For the other 5 it’s bad. If the agent listens to the Q function in all the bad scenarios and does the 2 exploratory steps where following the policy would have been good, wouldn’t the subsequent Q function get worse?

You could argue that next time maybe the opposite will happen where it listens to the Q function when it is “supposed” to and explores when appropriate (by luck of course, the agent obviously does not know with certainty which is the best action). But over time, how will the Q function ever converge? It could just flip flop as in the case I’ve described, never to converge. What I’m trying to say is that we could just flip flop between learning the “correct” action and then undoing it in the exploratory step. While the exploratory step helps us find better actions if we don’t have the best one already, it can also undo the best action we already have.

I’m really confused about this! Any help would be much appreciated. :slight_smile:

Hi @Aditya_Ranganath

First thing to understand is that optimal actions are not known - the agent does not know which 50 steps are optimal or not, or even how many steps were optimal in the first place. In general, if the reward scheme is won=1/lost=-1, then all the steps in the episode are rewarded/penalized at the end.

So in your example the agent picks the best action 80% of the time (even in the beginning when probabilities of the actions are initialized, it “thinks” it’s picking the best action but since probabilities are random at initialization those “best” actions are random).

As the training continues the Q function accumulates the probabilities which actions are better at certain states and picks them 80% of the time still but explores all actions 20% of the time.

When the training ran long enough… for example 1 million episodes, the probabilities “converge” according to the law of big numbers. For example a concrete silly case - the agent received state [ 2 apples, 3 oranges ] 10 000 times, of those times it won 9 000 times when it picked action “eat”. So the best action to pick would be “eat”.

To achieve better training there are strategies for epsilon decay, so that after some training you start to take less and less sub-optimal actions. In the previous silly example you would start to take an optimal action (“eat”) 81% of the time, after even more training you would start to take 82% and so on up to let’s say 99% at the very end (very few or no more exploratory steps).


Thanks @arvyzukai for your reply! With your explanation about the Q function accumulating probabilities over multiple episodes, I understand what’s going on. However, what I seem to be missing is how are the probabilities “preserved” from episode 1 to episode n. For example, say the agent chose to eat in the very first episode due to random initialization of NN resulting in that action. The agent is trained on this data set and knows it should choose to eat if ever in that state again. Next time around in episode 2, it chooses not to eat from the same state (explore step). Then the NN is trained on this data and the 3rd episode the agent chooses not to eat (exploit step). My question is how does the agent know that it had won in episode1 where it chose to eat and has lost in the subsequent 2 episodes. Yes, it got a reward in that first episode which may the signal, but how does the agent “remember” this. When it is trained on new data in episode 2 where it did not eat, it has forgotten about the reward from episode 1.

Looking forward to your response!


Hi @Aditya_Ranganath

You can check this Wikipedia article which contains the table to build better intuition.

It always keep track of when it won or lost in that state when it took that action. It does not distinguish episode1 from episode2. In other words, it accumulates the times it was in this state and took this action with the reward it received. For example, if it was in state [2 apples, 3 oranges] 100 times, out of these 100 it chose action “eat” 50 times, and it received the accumulated reward 40, then the value of this action being in this state is 40/50 = 0.8. If in the same state it chose action “not to eat” (other 50 times) and received the accumulated reward of 15, then the value of this “not to eat” action in this state is 15/50 = 0.3. So when the agent is in this state next time (101th time) it can choose what to do accordingly.

Hi @arvyzukai

Thanks for your reply! When we have a table where we keep track of all the previous state-action pairs and the accumulated rewards for each pair, I understand how it works now. However in the case of a Neural Net in a continuous state space with infinitely many possible state-action pairs, we cannot keep track of everything the agent has done in the past. So we train a neural net every c time steps lets say. We have a limited memory buffer to store all the (S,A,R,S’) tuples from the time the agent took its first ever action. So if we only have the agent’s last, say, 10,000 actions, how would it “remember” anything from before? Because we update the NN weights every C time steps. How would what the agent did 50,000 actions ago still influence the NN in present time? Doing soft updates and having a target Q network may help the NN not fluctuate too quickly too soon. But I still don’t see how the accumulated rewards from the first ever action 50,000 steps ago is kept track of or accumulated.

Understanding this would be a big breakthrough for me because I feel i’m missing something obvious :sweat_smile:

I really appreciate your help with this.


Hello @Aditya_Ranganath

In simple words, the neural network, like any neural networks, remembers a general “relation” between the inputs and outputs. It does not remember any particular action. It remembers the “relation”. For example, if my input is [1,2,3,4,5,6], and my output is [2,4,6,8,10,12], then what my neural network remember is y=2x. It does not remember either my input nor my output, but y=2x - the relation.

You said it can flip flop - that is true. It’s like if you ask the network to just learn how to eat, even if it had learnt how to drink previously, it could forget it. So when you train, make sure to always train the network with representative batches of samples and not biased to any particular scenarios. This practice is for any neural networks.


Hi @rmwkwok

Thanks for your response! I see. The data trained on needs to be representative of all kinds of scenarios. This solves my issue. I guess in supervised learning I assumed this to be the case intuitively. But in RL since we can’t be fully sure if all scenarios are well represented due to the random nature of data collection I was confused. I suppose using an experience replay is one way of getting a representative data sample over time. Am I correct in assuming that the buffer size should be big enough to hold enough data such that it can over time have good data representation? So an RL problem with a diverse/large environment with a lot of things the agent can interact with would need a larger buffer size and an RL problem with fewer things going on needs a smaller one (I guess it’s obvious but just wanted to confirm!). So the buffer size we would need is directly proportional to the complexity of the RL problem.

Right, but the input from the action 50,000 steps ago is no more in the input, correct? Since our buffer size is limited. Only most recent 10,000 inputs is what it is trained on, right?

Thanks again @arvyzukai and @rmwkwok for your time and inputs! Looking forward to your response.


Hi @Aditya_Ranganath


random is better. Random sampling gives us a good hope of a fair representation of samples. The bad thing about RL is that, we sample the world in a “sequential manner” where the latest events in the sequence are often more corelated. For example, consider now I am home, all samples I got is related to home activities, and if I get trained on those samples, I become good living at home. Then I leave home and go to my office for 8 hours of work, all samples I got during this period is related to work activities, and if I get trained only on these latest samples, I can become good at work but forget those home tricks.

See? We only appear in one place at a time, and all samples we collect at any moment is related to where we are. This is a non-random nature. If we want to train ourselves well at work and at home, we need samples from both places, but we cannot keep going back and forth between the two places, what can we do?

Excellent point and answer to my above question!

Good point!

Agreed. Not just the buffer size, but also the neural network size.

If the buffer size is 10,000, then yes, the NN can always be only trained on the latest 10000 events.


Hi @rmwkwok

Makes sense! This was what I was confused about earlier. I get it now.

Ah yes, makes sense!

Gotcha, so we need a large enough buffer size to be able to have well represented data.

Appreciate your help @rmwkwok !


Dear Mr Raymond,


“We create a separate neural network, Target Q-Network. It is because the y target is changing on every iteration. Having a constantly moving target can lead to oscillations and instabilities. To avoid this, we can create a separate neural network for generating the y targets.”

Could you please guide me on how does a constantly changing y target can lead to oscillations and instabilities?

I feel confused because after creating the training set, we start training a model to let the Q function learn to approximate y target. Once the training is done, we have a optimal parameter w at this round of iteration. Hence, we start the iteration again by collecting 10000 training examples to create a training set. Because of that, 10000 of y targets are changing on every iteration.

It seems like quite stable for me without oscillations occur but I can confirm that there must be something wrong in my concept.

Thank you

Hi @JJaassoonn,

It seems to me you have done some experiments and made the observation that it’s quite stable. For the sake of discussion, I need to know more about your experiment. Can you share a brief summary of it? Please make sure to include key observations and experiment settings that are relevant to your question. I may ask additional questions based on your summary.


Dear Mr Raymond

I had the intuition after studying the lecture as shown in this figure.

Thank you

Hello @JJaassoonn,

I just wanted to be careful. From your last reply, I don’t see how you came up with the following conclusion:

Can you further elaborate how you come up with that? What makes you think that it seems stable and without oscillation?

Note that it is very very important that we are clear about this since I believe this is the center of the confusion: the lab said moving targets can lead to instabilities and oscillations, but you think otherwise.


Dear Mr Raymond,

I think that the procedures 1, 2, 3 as shown in the figure are all in a sequential manner, which is a stable step-by-step procedure for me.

“We create a separate neural network, Target Q-Network. It is because the y target is changing on every iteration. Having a constantly moving target can lead to oscillations and instabilities. To avoid this, we can create a separate neural network for generating the y targets.”

Yes, indeed.

I have no idea on what this paragraph quoted from the lab material is trying to convey, especially the purpose of creating a Target Q-Network to solve an issue of “constantly moving target can lead to oscillations and instabilities”.

Thank you.

Hello @JJaassoonn,

To begin with, I don’t think the procedure is a convincing evidence that “constantly moving target do not lead to oscillations and instabilities”. Until you have a solid proof to that, I will just focus on how constantly moving target may lead to oscillations and instabilities.

Let me quote from the lab and we will base our discussion on it:

  1. Generally speaking, we have a large state space that no single training will cover a representative set of samples for the whole state space

  2. in other words, at each training, we are covering a sub space

  3. Imagine that in training round number 1, the model is trained to learn about sub space A, and in training round number 2, the model is trained on sub space B. Obviously, the model will shift from a model for sub space A towards a model for sub space B

  4. Similarly, in subsequent rounds of training, it will move towards sub space C, D, E, and so on.

  5. However, the model is supposed to learn about the whole space. Instead, it is shifting around sub spaces A, B, C, D, and so on.

  6. Consequently, if we need the model to predict for something in the range of sub space A, the model might still work acceptably at the beginning, but as it moves away to other sub spaces, the model’s performance became worsen.

  7. Therefore, we need a target Q network that is relectant to shifting between subspaces (can you tell why it is reluctant to the shifting?). Because it is reluctant to shifting, the target Q network will require the Q network to learn targets that are still pretty happy for sub space A even when the bot is exploring sub space E. It provides a balance. The target Q-Network wants the Q-Network to work well in both sub space A and E even if the two spaces can be quite different.

My point number 1 to 6 focus on an example of what may cause oscillation. It is mainly oscillating between the model’s coverage between different region of the state space. As the coverage shifts, some part of the space becomes performing worse.


PS: Here is a screenshot of the original slide for the procedure:

Dear Mr Raymond,

Thank you so much for teaching me all the useful knowledges.

No problem @JJaassoonn!

As a learner myself, I often find it very helpful to organize my proof and logical steps towards any statement I made, because this makes me and others easy to find where can go wrong. Afterall, finding where go wrong is our common goal. This is wanted and this is a good thing.

I also appreciate others who are willing to make statements because it shows the effort and provides an useful guide into how they consider the problem. They are our best and only guides to their minds on the question concerned.

Just to share some views. @JJaassoonn, good luck to your future learning journey :wink: