About approximating Q with a neural network,
I was questioning myself why would a estimation of ‘y’ using a random Q function after we take an action be better than just a simple guess?
1- Random y
2- y = R(s) + best Q(s’, a’)
Why would 2 give a result closer to the real y than 1, after the first step, where Q is still random?
Assuming that it does (because RL works), the only information added in 2 is R(s), s’, a’, and Q(s’, a’).
I suspect both Q(s’, a’) and R(s) add new information that improve y, because they are on the equation. But I really don’t imagine how.
First, about Q:
How can a random function add information? Even though the inputs s’ and a’ are not random, Q is.
If I give a meaningful seed to a python random number generator, the output is still random if I know nothing about how the generator works, as is the case with the first initialized random Q.
So the only possible conclusions I see are:
- I’m missing something and the output of a random Q is not always random depending on the input
or - We could substitute Q for any random number in the first step of the algorithm and it wouldn’t change anything, nothing would be lost at that first step. So information to improve y would be coming from R(s).
About R(s):
If R(s) adds information, there is a new question:
In what situations adding a number to a random number makes it closer to a given target?
Suppose I had to pick a natural number from 1 - 100 randomly to try to hit a target: 40.
If someone gives me a tip: “the target number is greater than 20”, my chances of success are now bigger going from 1/100 to 1/80.
In this way, I get how R(s) could improve our estimation of y.
But this can only work if I somehow know with certainty that the value of y is bounded between
1 - 100.
(I already assumed the output of Q is bounded between 1 - 100)
If that’s not the case, someone saying R(s) = 20 would only make my guesses go from 0 - 100 to
20 - 120, not improving my chances of hitting 40.
So to recapitulate, I don’t know how a random Q would help.
About R(s), I believe it only helps if y is bounded.
And a more difficult question for R(s).
For this to work, we are supposing R(s) somehow ‘knows’ about the correct y.
In our example the tipper is only able to give a good tip if it knows the location of target.
But in a Markov process, our current state reward has no relation to the final return after following the best path. It doesn’t even know the other states exist.
We could get a R(s) of -10 from our current state because we were given a bad hand, or started with our rocket upside down, but after taking action a, our y could be + 20 if we follow the best path and play well or land the rocket successfully.
In the above example, R(s) would contribute to making our estimated y worse. It would increase the distance between any random Q output and the correct y = 20 by subtracting 10 from it.
This were some of the ways I tried to understand how Q learning works, but failed.
As you can see above, I have really no idea, that’s why I ask for help.
Thanks!