I am quite confused on State-action value function definition video at 5:04 specifically at Q(s,a).
From my understanding Q(2, →) should be = 0 + (0.5) (0) + (0.5^2) ( 0 )+ (0.5^3) (40). Why would it be 0 + (0.5) (0) + (0.5^2) ( 0 )+ (0.5^3) (100)?
If you evaluate Q(2,→):
You are asking, “If I start in state 2 and take action →, what discounted return do I get?”
Following that path, you end up in the terminal state with 10
If you evaluate Q(2,←):
You’d end up in the terminal state with 40 instead.
The slide in my question is here. I am confused on calculation for Q(2, →) as we are moving to the right shouldn’t we are moving toward 40? I don’t get how can we move to 100 with 3 zero. I understand Q(2, ←) [1 zero at state 2 and 0.5 at state 1] and Q(4, ←) [1 zero at state 4 + 1 zero at state 3 + 1 zero at state 2 + 100 at state 1].
So for Q(2, →) shouldn’t we look at state 2, state 3, state 4, state 5, and state 6?
Therefore, Q(2, ->) does not mean we keep moving right all the way. The “->” only tells us to move right once, and then it should behave optimally after that, which is why it would turn around and move left to “100”.