All other numbers seem to check out, so maybe I missed something.
Any clarification is appreciated.
Edit 1: Ok, so I realize that I have jumped the gun slightly as I moved from the state-action function lesson to the State-action value function example Lab.
I am now working through the Bellman equation lesson and see that some vital parts of the puzzle were missing from the above equation, I am still finding this rather confusing though, as it still seems that even with the new equation of:
So I just completed the state-action value quiz and initially got the last question wrong.
This was due to not at first recognizing that the direction of movement would change after the initial action due to the algorithm then “behaving optimally after that.”
I think that possibly the same is happening with my understanding of the lab task. It’s not so much the maths but more about understanding the purpose of the value of num_actions = 2 and at what point directions may change in order to behave optimally so as to get the values displayed in the lab task.
The error was in my misunderstanding of num_actions = 2. Although the lectures explicitly state the action is to happen only once, with this variable being given the value 2 I thought that it meant that whichever direction is chosen from whichever starting state that movement would happen in that direction twice and then begin behaving optimally.
After calculating the above but only moving once to the right and then moving optimally (to the left) I do indeed get the value of 6.25.
I guess the only question that remains then is, what does 'num_actions = 2’ actually mean, and how does it affect the outcome of the algorithm.
We assigned a value 2 to the variable num_actions but the variable is never used in the rest of the optional lab, so it does no effect to the outcome of the algorithm. You can search num_actions in the C3 W3 optional lab and find that it only appears once in num_actions = 2 but the variable itself is never used in elsewhere.
It means “there are 2 actions”, which is either going left or right.
However, in the assignment, 4 is assigned to num_actions and it is actually used to correctly build the neural network that predicts the Q-value for each of the 4 actions.