Where does the information to improve Q come from?

About approximating Q with a neural network,
I was questioning myself why would a estimation of ‘y’ using a random Q function after we take an action be better than just a simple guess?

1- Random y

2- y = R(s) + best Q(s’, a’)

Why would 2 give a result closer to the real y than 1, after the first step, where Q is still random?

Assuming that it does (because RL works), the only information added in 2 is R(s), s’, a’, and Q(s’, a’).

I suspect both Q(s’, a’) and R(s) add new information that improve y, because they are on the equation. But I really don’t imagine how.

First, about Q:
How can a random function add information? Even though the inputs s’ and a’ are not random, Q is.
If I give a meaningful seed to a python random number generator, the output is still random if I know nothing about how the generator works, as is the case with the first initialized random Q.

So the only possible conclusions I see are:

  • I’m missing something and the output of a random Q is not always random depending on the input
  • We could substitute Q for any random number in the first step of the algorithm and it wouldn’t change anything, nothing would be lost at that first step. So information to improve y would be coming from R(s).

About R(s):
If R(s) adds information, there is a new question:
In what situations adding a number to a random number makes it closer to a given target?

Suppose I had to pick a natural number from 1 - 100 randomly to try to hit a target: 40.

If someone gives me a tip: “the target number is greater than 20”, my chances of success are now bigger going from 1/100 to 1/80.

In this way, I get how R(s) could improve our estimation of y.

But this can only work if I somehow know with certainty that the value of y is bounded between
1 - 100.

(I already assumed the output of Q is bounded between 1 - 100)

If that’s not the case, someone saying R(s) = 20 would only make my guesses go from 0 - 100 to
20 - 120, not improving my chances of hitting 40.

So to recapitulate, I don’t know how a random Q would help.
About R(s), I believe it only helps if y is bounded.

And a more difficult question for R(s).
For this to work, we are supposing R(s) somehow ‘knows’ about the correct y.
In our example the tipper is only able to give a good tip if it knows the location of target.

But in a Markov process, our current state reward has no relation to the final return after following the best path. It doesn’t even know the other states exist.

We could get a R(s) of -10 from our current state because we were given a bad hand, or started with our rocket upside down, but after taking action a, our y could be + 20 if we follow the best path and play well or land the rocket successfully.

In the above example, R(s) would contribute to making our estimated y worse. It would increase the distance between any random Q output and the correct y = 20 by subtracting 10 from it.

This were some of the ways I tried to understand how Q learning works, but failed.
As you can see above, I have really no idea, that’s why I ask for help.


1 Like

Hello Douglas,

Reward is the source of information.

No, it can’t. Reward is the source of information.


When it makes it no longer purely random. Initially it is purely random, but as we add information to the model, it becomes less random. We add information from Reward by updating the model with gradient descent.

Random Q won’t help.

Please don’t bound how you understand RL to the example you give. Your example is a good one to explain how additional information can make improvement, but not everything works in that way.

Shall we take a step back and think again how neural network training works? and how linear regression training works? They are topics covered in course 1 & 2 and I assume you have gone through them by taking the courses, or by any other means.

A neural network or a linear regression model begins with randomized parameters, agree?
Then we add information. The information is the labels, agree?
The neural network will become better and better over the training progress because we keep adding information to it, agree?
Each time we add information, the NN becomes a bit better, and at the same time, less “random”, agree?

The similar thing between RL and Linear regression training is that, they both start with randomized parameters, agree?
They both start from being poor, then somehow become good at the end, agree? (as you said, RL works)
The different thing we may want to focus on between them is that, we don’t use a poor linear regression model to make prediction for real-world purpose, instead we continuously train it until the linear regression model becomes good enough to serve. However, for RL, we use a poor RL to decide what the robot does next. This is the difference, agree?

However, there is another eqaully important difference between them: we have a good set of training data ready for the linear regression model, but we we don’t have such dataset for RL. Agree?
How can we get data for the RL robot? Think? By taking action, to be honest, any action, be it a good action, a bad action, or a randomly picked action. No action, no data. What choice do we have?

With action, we have data, and with data, we train the model with the data. If a linear regression model can improve with training data, and if a neural network can improve with training data, why can’t a neural network that’s used for RL be improved with training data? Why can’t it?



Hey @dncjorge,

I just want to add a little bit to Raymond’s detailed explanation here. Raymond has very well covered here how Q will learn over time, and hence, the probability of producing random outcomes will decrease over time.

As for the second point, yes, as long as you are following the same distribution for sampling random numbers, you can put any random number, but as you said, only in the first step. Since from the second step onwards, Q can produce learned outcomes as well. Now, you may wonder here why I have highlighted the “same distribution” here. This is because, if you put large random numbers as initial values, your rewards and updates might not be able to reverse those over-optimistic or over-pessimistic initial values.

Let me tell you my strategy when I train any RL agent. If the rewards are 0 for all the states except the terminal state (assuming an episodic task here) or a few states including the terminal state, then I initialize the values as small random values. If the rewards are all positive, then I sample the initial values from a uniform distribution between 0 and 1, and if the rewards are both positive and negative, then I sample the initial values from a normal distribution with a mean of 0 and variance of 1. Now, if the rewards are present for a majority of states, then you will find that even 0 initialization will work just fine.

I hope this helps.



Hi Raymond, thanks again for your help!
Yes, I get the general idea.
But I want to know the specific mechanism I’ve mentioned.

The equation is exactly like that in Andrews slides and on the python implementation.

The process to calculate each component of the y vector is that sum expressed in the equation.
There’s no way to avoid the question of how adding R(s) improves our estimation of y.

And the components of the equations are just simple numbers component wise.

I’ve just thought of a possible way R(s) can help.

I would appreciate if you could tell me what you think about the hypothesis below.

We want to know the result of a chain of n sums (the return).
If I new the n components, I would know the full sum.
If I knew (n-1) components, I would be less certain about the result, but assuming that the last unknown component has a value inside a limited range that we can estimate, I would be able to have a good guess of the final sum.
Continuing this line of thought, if I knew only one component of the sum, and given that the length of the chain is finite and all the other components are in a reasonable known range, I now objectively gained new information that allows me to decrease the doubt I initially had of the final sum value.

This information is very small, but real.

A positive is the fact that the discount factors get exponentiated along the chain of states, making the contribution of the next chain node to the final sum be smaller and smaller.

So, of all states (from our to the final one in a path ) that we could pick, the one that would give us the maximum amount of information about the best path return is the current one, because it’s value has the same theoretical range of the others, but has the advantage that it’s not multiplied by the discount factor.

I believe what I said in my original post is true: a single state can actually move our prior probability of the return value in the wrong direction, but it’s probably a rare event.

Most of the time if our return is a large positive number and I pick a random component of the sum that constitutes it, chances are higher that I would pick a component with a positive value.
There may be some components with negative rewards in the sum, but in smaller quantities most of the time, otherwise the final return would tend to zero or even negative.

The exception would be when few very large positive rewards outweigh many negative ones. But this is probably rare.

I can’t think of why right now but it’s what my intuition says: given a positive sum, it’s more probably that it results from the sum of many positive values compared to many negative values and some very large positive ones.

(This problem remembers me of how the allies estimated the total number of German tanks based on the serial number they saw on some of the captured ones.
I know it’s not the same problem because in one you want to know the sum of the numbers, and in the other a range, but both involve some inference and bayesian thinking)

Also thanks for the clarification about the random function not adding information.

Hello Douglas,

Let me ask some clarifying questions:

Can you share a screenshot of the slide which contains the equation? I am not sure about which equation you are referring to. You may find the slides here.

I have a feeling that you are speaking about the Bellman equation which is the sum of a series of rewards discounted by gammas.

However, in the first post of this thread, I suppose you are speaking about the Q-network. Although Q-network and the Bellman equation both speak about the Q-values, they are not equivalent. Which one should we focus on now?

You are referring the information to as the “only one component of the sum”?

current one = reward in the current state?

So is this your hypothesis? Is this what you want to talk about in this post? And how does it relate to Reinforcement learning?

For me to just look at this statement, not in the context of RL, I would agree that if the sum of a series of number is positive, the chance is higher for me to pick a positive number from the series, if the numbers are gaussian distributed.

Let me know :slight_smile:


Hello Douglas @dncjorge ,

On second thought, it seems to me you are trying to argue how rewards add info to the neural network, by discussing how reward at the current state be meaningful to the Bellman equation.

I want to raise a point that a Q-network is not the same as the bellman equation, even though they are both about estimating some useful Q values for decision making.

Bellman equation is a way that will tell us the expected return.

However, a Q-function doesn’t tell you the same thing. We are only trying to optimize the QN in the hope that it will produce the y as defined below. Moreover, this is not the only way you can define y, and any useful way is possible.

Screenshot from 2022-10-19 17-25-25

I hope you see that in the above algorithm, nothing like “then behave optimally after that” ever shows up. So, it is not the same as the bellman equation.


To convince myself that R adds information, it would be sufficient to know that my QN learns these x-y samples. Put it this way, comparing with the initial randomized QN, if, after some training process, the QN accepts x and produce exactly y, then I am convinced that my QN has received the information from the training samples pretty well. Of course, my QN won’t produce exactly y, but with some errors. However, the errors should reduce over more training steps, because more and more information is added to my QN.

So I would say. R presents itself in the training data, and adds information to the QN through gradient descent, or the optimization algorithm. This is how I would approach the titled question directly, instead of going through the bellman equation which is actually pretty different from Q-network.


Hi Raymond!

Can you share a screenshot of the slide which contains the equation?

Yes, I’m referring to Bellman Equation

Although Q-network and the Bellman equation both speak about the Q-values, they are not equivalent. Which one should we focus on now?

I’m asking about the interaction of the two.
When training the neural network that will approximate Q, we use Bellman Equation to generate the estimations for y.

You are referring the information to as the “only one component of the sum”?

Yes, I’m trying to say that the return Q(s, a) after being in state s, taking action a, and then behaving optimally after that, is what Bellman Equation tell us it is:

Q(s, a) = R(s) + \gamma*max_{a'}Q(s', a')

This is ultimately a sum of many components:

Q(s, a) = R(s) + \gamma* R(s') + \gamma^2 * R(s'') + \gamma^3 * R(s''') ... + \gamma^n * R(s^n)

I’m saying that we need to estimate the value of this whole expression above.
First we start with a random guess. Then, after one iteration we should have a better guess consisting of R(s) plus a estimation of
\gamma*max_{a'}Q(s',a')=\gamma * R(s') + \gamma^2 * R(s'') + \gamma^3 * R(s''') ... + \gamma^n * R(s^n)

So, by ‘‘one component of the sum’’ I refer to R(s).

current one = reward in the current state?

Yes, R(s)
If we can assume all other returns R(s'), R(s'')... have the same range of possible values with a similar probability distribution, the first component of the sum (R(s)) will be, on average, most of the times, the biggest factor of the sum, because it’s not multiplied by the discount factor.

So is this your hypothesis? Is this what you want to talk about in this post?

Yes, this is my hypothesis.
What I want to talk about is if my hypothesis is a good explanation for how exactly the knowledge of a single component (R(s)) of a sum, plus a random estimation of the rest of the components (max_{a'}Q(s', a')) is better than just guessing.

You explained to me that calculating (max_{a'}Q(s', a')) with a random Q helps with nothing, so the only other possibility was investigating how R(s) allow us to improve our guess of Q(s, a) .

And how does it relate to Reinforcement learning?

The relation occurs when we need to map a value of Q(s, a) for every given state s and action a, without the need to explore every possible s and a.
To estimate Q(s, a), we use a neural network.
But to train that neural network and calculate its errors we would need the real value of Q(s, a) for every s and a, that we call y.
edited above to clarify that by Q(s,a) I refer to it’s output value and not the function Q
We don’t have those y, because we don’t have labeled data.

So the proposed solution is to start a loop of improvement.
Start with random y and a random estimation of Q, then:

1- use Bellman Equation, R(s) and our guess of Q to estimate a better y
2- use our better estimation of y to generate a better estimation of Q by:

training QN some number of times (using y)
transferring some of that training to TQN with a soft update
using TQN to estimate a better y
(go back to 1)

My question is related to step 1.

Thanks for your time and help.


Hi Raymond!

I hope you see that in the above algorithm, nothing like “then behave optimally after that” ever shows up.

When I say “behave optimally after that” I refer to the term \gamma*max_{a'}Q(s', a'), so that we pick the best actions possible in every state that follows to guarantee the maximum return.

To convince myself that R adds information, it would be sufficient to know that my QN learns these x - y samples.

Yes, I’m convinced about that fact. I just want to know the mechanism of how that can happen simply by knowing R(s) in every iteration.

instead of going through the bellman equation which is actually pretty different from Q-network

Yes, thank you for clarifying the difference! But the thing is the training of the neural network has a crucial step in it that uses the Bellman Equation, that’s why I used it in my question.

thank you again for your help!


Hi @Elemento , thank you for the help!

Just to clarify, when you say:

If the rewards are 0 for all the states except the terminal state (assuming an episodic task here ) or a few states including the terminal state, then I initialize the values as small random values.

By ‘the values’ you refer to the initial guess of y right?

If this is the case it made perfectly sense to me.

As the improvements of y are very small on every iteration, if your initial estimated y are very far from the real y, they will improve and converge extremally slowly to the correct value.

So to avoid that you make suppositions on how to pick a good initial guess of y by the sampling method you described.

Did I understand correctly?



Hello Douglas,

Thanks for your explanations.

First, I agree with most of the below except I believe you meant to say “using TQN to estimate a better Q” in the 2nd last line, because such saying echoes your “to generate a better estimation of Q by”.

I also agree that

except that it is not because one R(s) component is representative. It is not representative because in principle, \gamma can be very close to 1 such that \gamma^4 can still be comparable with \gamma (e.g. 0.99^4= 0.96). Second, no matter what distribution R(s) follows, we almost never can say that one sample of such distribution is representative.

However, I am not overthrowing your idea, indeed I agree with the above 2 quotes very much, and “representative R(s)” is not necessary to support those 2 quotes. Your 2 quotes alone are sufficient.

I think the above 2 quotes of your posts should be sufficient?


Yes, thanks for the correction

I think the above 2 quotes of your posts should be sufficient?

Yes, asking the questions and dialoging about it forced me to systematize my doubts and understanding.

I believe now I am finally comfortable with my intuition of how all of this work.

Thank you for your patience and clarifications Raymond!


You are welcome Douglas! I enjoyed discussing with you and I think we have agreed on a lot of things!


Hi all,

Thanks Douglas for starting this thread. I have the same exact question as Douglas but unfortunately I’m not really convinced how the algorithm works by going through the conversation here.

Pardon me if I’m repeating the questions here:

In the above snapshot, value of y itself is a guess (may not be purely random, because of R(s)), so what’s the point of training Qnew to approximate y?

Also, I learn by examples… Let’s take the discrete space Mars Rover example in the below slide:

Can someone please outline the steps taken by the learning algorithm with numerical values of initial guesses?

An optional lab with the visualization of this process would definitely be very useful for future takers of this course!

Thanks a lot,

Hello @Pavan_Kumar_R,

Let me not go too far first.

Nowhere else is there a true y available for the training. Nothing is better than that y that is better than purely random because of R. Agree?

Let’s forget about the truth that it indeed can work (at least the assignment shows it can work), if y can be improved by such training from purely random to not purely random, can’t such training further make more improvement?


Hi Raymond,

Thanks for your response. I agree that R adds more information to y in every iteration. I wish I could see the internal steps of the algorithm working to convince myself better. Let me try to work it out.


Yes, please. Then we can try to develop something from your words.


Hi Douglas,
I have the same problem understanding why using a single observation (R) makes the estimate (output of TQN) better. Is it not possible that having observed a good value of R, on subsequent training episodes, we observe a smaller value of R and replace our earlier learning about the state’s overall return with the new lower value?
I think it isnt possible without some sense of “dont replace the state’s learnt value if the new value you saw is lower”, but there’s no such thing in the algo.
I dont see how the NN will steadily get better, and not keep oscillating based on what it recently observed.