Just looking for a little clarification on the example and the number that Q(s,a) produces whether we are moving left or right.
I can’t seem to figure out where 6.25 comes from in state 3 moving right, as given that the terminal state reward is 40, I calculate that this number should be 5.0?
Maybe I have misunderstood the math behind this but here is how I have calculated from this state, moving right, with 2 actions, then optimally after those, and with gamma 0.5 and misstep_prob = 0.
All other numbers seem to check out, so maybe I missed something.
Any clarification is appreciated.
Thanks.
Edit 1: Ok, so I realize that I have jumped the gun slightly as I moved from the state-action function lesson to the State-action value function example Lab.
I am now working through the Bellman equation lesson and see that some vital parts of the puzzle were missing from the above equation, I am still finding this rather confusing though, as it still seems that even with the new equation of:
Here is the state-action value function lesson, timestamp 1:10, we can see the setup as described above, but the lab example from the video shows 6.25 for state 3, moving right.
This is the value that I calculate to be 5.0, I think I must have done something wrong.
So I just completed the state-action value quiz and initially got the last question wrong.
This was due to not at first recognizing that the direction of movement would change after the initial action due to the algorithm then “behaving optimally after that.”
I think that possibly the same is happening with my understanding of the lab task. It’s not so much the maths but more about understanding the purpose of the value of num_actions = 2 and at what point directions may change in order to behave optimally so as to get the values displayed in the lab task.
The error was in my misunderstanding of num_actions = 2. Although the lectures explicitly state the action is to happen only once, with this variable being given the value 2 I thought that it meant that whichever direction is chosen from whichever starting state that movement would happen in that direction twice and then begin behaving optimally.
After calculating the above but only moving once to the right and then moving optimally (to the left) I do indeed get the value of 6.25.
I guess the only question that remains then is, what does 'num_actions = 2’ actually mean, and how does it affect the outcome of the algorithm.
We assigned a value 2 to the variable num_actions but the variable is never used in the rest of the optional lab, so it does no effect to the outcome of the algorithm. You can search num_actions in the C3 W3 optional lab and find that it only appears once in num_actions = 2 but the variable itself is never used in elsewhere.
It means “there are 2 actions”, which is either going left or right.
However, in the assignment, 4 is assigned to num_actions and it is actually used to correctly build the neural network that predicts the Q-value for each of the 4 actions.
I understand now that num_actions = 2 was representative of the fact that the Mars Rover discrete example only had 2 possible actions, namely, left or right.
Not sure why all this didn’t fall into place sooner.