**About approximating Q with a neural network**,

I was questioning myself why would a estimation of ‘y’ using a random Q function after we take an action be better than just a simple guess?

1- Random y

2- y = R(s) + best Q(s’, a’)

Why would **2** give a result closer to the real y than **1**, after the first step, where Q is still random?

Assuming that it does (because RL works), the only information added in **2** is R(s), s’, a’, and Q(s’, a’).

I suspect both Q(s’, a’) and R(s) add new information that improve y, because they are on the equation. But I really don’t imagine how.

**First, about Q:**

How can a random function add information? Even though the inputs s’ and a’ are not random, Q is.

If I give a meaningful seed to a python random number generator, the output is still random if I know nothing about how the generator works, as is the case with the first initialized random Q.

So the only possible conclusions I see are:

- I’m missing something and the output of a random Q is not always random depending on the input

or - We could substitute Q for any random number in the first step of the algorithm and it wouldn’t change anything, nothing would be lost at that first step. So information to improve y would be coming from R(s).

**About R(s):**

If R(s) adds information, there is a new question:

In what situations adding a number to a random number makes it closer to a given target?

Suppose I had to pick a natural number from 1 - 100 randomly to try to hit a target: 40.

If someone gives me a tip: “the target number is greater than 20”, my chances of success are now bigger going from 1/100 to 1/80.

In this way, I get how R(s) could improve our estimation of y.

But this can only work if I somehow know with certainty that the value of y is bounded between

1 - 100.

(I already assumed the output of Q is bounded between 1 - 100)

If that’s not the case, someone saying R(s) = 20 would only make my guesses go from 0 - 100 to

20 - 120, not improving my chances of hitting 40.

So to recapitulate, I don’t know how a random Q would help.

About R(s), I believe it only helps if y is bounded.

And a more difficult question for R(s).

For this to work, we are supposing R(s) somehow ‘knows’ about the correct y.

In our example the tipper is only able to give a good tip if it knows the location of target.

But in a Markov process, our current state reward has no relation to the final return after following the best path. It doesn’t even know the other states exist.

We could get a R(s) of -10 from our current state because we were given a bad hand, or started with our rocket upside down, but after taking action a, our y could be + 20 if we follow the best path and play well or land the rocket successfully.

In the above example, R(s) would contribute to making our estimated y worse. It would increase the distance between any random Q output and the correct y = 20 by subtracting 10 from it.

This were some of the ways I tried to understand how Q learning works, but failed.

As you can see above, I have really no idea, that’s why I ask for help.

Thanks!