I was playing around with the State-action value function example lab and I noticed something interesting:

Misstep_prob = 0: The optimal policy and Q(s,a) are such that if you are in state 5, you will move right and for states 4 and lower you will move left.

Misstep_prob = 0.1: Here the behaviour is the same as above but the values of Q(s,a) are slightly lower.

Misstep_prob = 0.5: Here something interesting happens. Essentially a 0.5 misstep means that every step is a coin toss as to whether you get it right or not (the action result is random). The jupiter notebook takes a while to calculate this scenario. But once the results are in, the Q(s,a)s of moving right or left in every state are equal and yet the optimal policy seems to always be to move left towards state 1 for the 100 reward. Why is that? I would imagine if the rewards from moving right or left are equal, the rover would not be able to move to the next state.

Misstep_prob = 0.7: Here the optimal policy has become to actually move from state 5 to the mid point between state 1 and state 6. The rover would move left until he reached state 3 at which point he would move right to state 4 and would keep bouncing between state 3 and 4. Why?

Misstep_prob = 1: Same behavior as above but now the rover would be bouncing between states 5 and 4.