State Action Value Function misstep_prob = 0.9999 favors low rewards

I wanted to see how the State Action Value Function handles a high misstep_prob because it seems to me most real-world applications would have to deal with a lot of errors due to not being able to model all the influential variables.

However, I did not expect that the optimal policy would always favor the lower reward. Here is a pretty extreme example:

terminal_left_reward = 10000
terminal_right_reward = 1
each_step_reward = 0

# Discount factor
gamma = 1

# Probability of going in the wrong direction
misstep_prob = 0.9999

Why does the policy always favor the lower reward?


I was able to replicate the same result by swapping the left and right reward:

terminal_left_reward = 1
terminal_right_reward = 10000


This doesn’t make sense to me. Can I have some help understanding it?

Hello Michael @mosofsky ,

A misstep_prob of 1 means that it must go in the wrong direction: if we decide to go left, it MUST go to the right, 100% probability.

For your case of misstep_prob=0.9999, it almost always goes in the wrong direction.

Now look at your case where terminal_left_reward=10000, The optimal policy favors the action of moving to the right, so that it will almost always move to the left as a result of a wrong step.

If you want to see the most randomized case, set misstep_prob = 0.5, then at each state, the optimal Q value to the left and to the right should be the same.


1 Like