I just finished the video, yet confused by the state action value demonstrated by the professor.
I am wondering, if the current direction is right, should we change the Q(2,right) function as following, since the reward at the right side is 40.
Q(2, right) = 0 + 0*0.5^1 + 0*0.5^2 + 0*0.5^3 + 40*0.5^4
Hello @James_Yu1,
The above is the only answer, because we are bounded by this requirement:
To behave optimally after moving to grid 3, we will have to keep moving left all the way to grid 1.
Cheers,
Raymond
1 Like
@James_Yu1 However, if you really choose to keep moving right, then you will move to grid 6, then you will finally be rewarded 40 points, however, the definition for Q won’t be changed. Q(2, right) is always 12.5.
1 Like
Hi @rmwkwok ,
How does it know it behave optimally?
To behave optimally after moving to grid 3, we will have to keep moving left all the way to grid 1.
Isn’t it that we need to calculate rewards of all possibility move of every single action and then find the maximize value so that we know after moving to grid 3 should move keep moving right or move back to left?
1 Like
Hello, @ansonchantf,
You have answered your own question! Generally, we will first need to calculate all state-action’s Q values before we know the absolutely best move. This slide does not show such steps, but since we are taking this very simple example, even just by inspection, we can still tell what the best action should be.
Cheers,
Raymond