State-action value function example question


Do Bellman equations still apply if I set the left reward at zero? I would have expected the values encircled in the generated graph to be zero in this case.

I believe you are setting the misstep_prob to 0.4 which means that even if we decide to go LEFT, there is still a chance that the robot will go RIGHT and finally get some rewards. This is why you see those encircled positive rewards.


No that was just me trying to play with the misstep_prob for another reason. However, even with misstep_prob is set to zero, i still see values for the left direction. Below is an updated screenshot. Thanks!

Perhaps try to re-run all the cells after setting 0 for misstep_prob?

I did the tests myself with misstep_prob be 0 or 0.4. I can reproduce your result with 0.4, and as for misstep_prob = 0, the LEFT value in state s is always half the RIGHT value in state s-1 because every time it takes the LEFT action in state s and reach s-1, it will then turn to go RIGHT to achieve the biggest value. It is “halved” because of the gamma, it will turn RIGHT because Q is defined as followed:

My screenshots below:

misstep_prob = 0

misstep_prob = 0.4


But in your solution, I see that some LEFT values aren’t zero with misstep_prob = 0. Please see encircled values below. Thanks!

Thank you. I modified my answer.