Hi @karko2tr,
Besides @canxkoz’s excellent explanation, let’s look at a specific example.
Before we start, let’s reiterate your sample and make an assumption upon that. A training sample includes a state, an action, a reward, and a next state, or S, A, R, S’ in symbols.
- y_target for S, A = R + gamma * max_A’_Q(S’, A’), as you described, or R + rand1
- y_predict for S, A = Q(S, A) = rand2.
- The assumption is that, after training, the y_predict on the same training sample (S, A, R, S’) will equal to R + rand1. Ok?
Now comes the specific example.
Let’s say we obtained a series of samples through a simulated robot:
S1, A1, R1, S2, A2, R2, S3, A3, R3, S4, A4, R4, S5.
Here, we have 4 samples:
- S1, A1, R1, S2
- S2, A2, R2, S3
- S3, A3, R3, S4
- S4, A4, R4, S5
By our assumption, after training, the following predictions are more realistic:
- y_predict for S1, A1 = Q(S1, A1) = R1 + gamma * Q(S2, A2) (assume A2 is still the best action)
- y_predict for S2, A2 = Q(S2, A2) = R2 + gamma * Q(S3, A3) (assume A3 is still the best action)
- y_predict for S3, A3 = Q(S3, A3) = R3 + gamma * Q(S4, A4) (assume A4 is still the best action)
- y_predict for S4, A4 = Q(S4, A4) = R4 + gamma * max_A’_Q(S5, A’) = R4 + gamma * rand
Now, what can these equations tell us? If we substitute equations, we get
Q(S1, A1) = R1 + gamma * ( R2 + gamma * (R3 + gamma * (R4 + gamma * rand) ) ), or,
Q(S1, A1) = R1 + (gamma * R2) + (gamma^2 * R3) + (gamma^3 * R4) + (gamma^4 * rand)
The Take-away
If all of our assumptions are valid, now, our Q-network is able to produce a Q(S1, A1) as a summation of 4 concrete values and one rather random value. Particularly, that only random value is even largely penalized by gamma^4 which makes it quite INSIGNIFICANT. In other words, now, our Q-network produces a prediction that included an insignificant portion of random stuff.
Of course, you can argue against my assumptions. For example, I said,
assume A2 is still the best action
and you might argue that what if A2 is not still the best action? Then I would say, even if A2’ was the actual best action, it only means that we just needed another sample S2, A2’, R2’, S3’ for us to train the Q-network to get a good Q(S2, A2’). Therefore, if we have a sufficient number of samples to train the Q-network, my assumption will be stronger.
You might also argue another assumption of mine:
after training, the y_predict on the same training sample (S, A, R, S’) will equal to R + rand1
This is a matter of we have a good architecture and training algorithm for our Q-network.
Look, in above I am only providing a way to rationalize why it can work, but not why it must work (we can keep attacking the assumptions). I also suggest you to finish the week’s assignment to see it for yourself that it can work.
Cheers,
Raymond