for the action point to the right at state 2. It says 50 , 12.5 but it should have been 50, 3.125
because its 0.5^4 * 40 but i believe professor accidently used 100 instead of 40
You have to follow the optimal path (e.g. that arrows). So from state 2, you end up in state 3, but due to the optimal direction, you again end up in state 2 and then state 1.
Therefore, it would be:
0 \times 0.5 + 0 \times 0.5^2 + 100 \times 0.5^3 = 12.5
I think @Mujassim_Jamal meant to say finally ending up in state 1 (instead of 3).
Oh Sorry, i didn’t notice that. Edited my response. Thanks @rmwkwok for pointing out !
Why at state 1, take action to the right give you 100?. It must 100 + 0 * 0.5 + 100 * 0.5^2 = 125. I don’t understand
The slide in the first post of thsi thread does not say that taking the right action at state 1 has a Q value of 100. It is simply saying that the reward arriving at state 1 is 100. Note the difference between Q-value and Reward which is important because the formula in your question is one that calculates a Q-value.
In fact, state 1 and 6 are terminal states where we won’t take any more action, in other words, we won’t consider the case that goes from state 1 to state 2 and then back to state 1. As long as we are in state 1, that’s the end, no more action and no more going to state 2.
Cheers,
Raymond