I think there is an error or is my understanding off


how does Q(2, right) give 12.5, isn’t it suppose to be (0.5)^3 * 40?

Which video is it?

Raymond

Course 3 week 3, state action value function definition

OK! Then I guess you must be talking about this slide:

So the answer is, after you turn right, the OPTIMAL path is then to turn left, left, and left to reach the 100 reward points.

Raymond

ok, so Q always results in the optimal path but just that it has to take a step first to know which is the optimal path after the step correct?

based on this image if I take Q of state 5 to the left I end up with 6.25 which is less optimal using the bellman eqn, so will I still move to the left and continue to do so since the next state, state prime, is 4 and the Q max values traverse to the left till 100 at the state of 4. or how does it use the formula to go from state 4 to then go back to a state 5 where it would have originally been more optimal to traverse right instead?

if possible do u mind showing me some math eqns to explain if it’s too tricky then just an explanation without a diagram would work too. Thanks a bunch

1 Like

image

There are 2 keys here:

  1. Q(s, a) answers us what the Q is if I am in state s and I take action a. a is a variable of our choice. a is the action we CHOOSE to take. a is chosen regardless it is optimal or not. In simple words, I can choose my first action to be not optimal. But after the first action, the rest of the actions have to be optimal.

  2. Follow the three bullet points strictly, because they define Q(s, a).

Raymond

i see so we take action 1 once then behave optimally, so to relate to my example given above, it would go from state 5 to 4 then back to 1. is the correct to say so since it behaves optimally after the 5 to 4?

Yes, after moving from 5 to 4, we start to behave optimally, and to behave optimally, we have to keep going to the left.

@zheng_xiang1

Such definition makes a lot of sense right? My Q(s, a) tells me the Q value if I went left, and the Q value if I went right. Given both values, I can choose the best action to take at state s, right? Without these two pieces of information in the first place, how could I decide which way to go?

1 Like

yup, tks ray!! for the explanation again