I wanted to see how the State Action Value Function handles a high misstep_prob because it seems to me most real-world applications would have to deal with a lot of errors due to not being able to model all the influential variables.
However, I did not expect that the optimal policy would always favor the lower reward. Here is a pretty extreme example:
terminal_left_reward = 10000
terminal_right_reward = 1
each_step_reward = 0
# Discount factor
gamma = 1
# Probability of going in the wrong direction
misstep_prob = 0.9999
Why does the policy always favor the lower reward?
I was able to replicate the same result by swapping the left and right reward:
terminal_left_reward = 1
terminal_right_reward = 10000
This doesn’t make sense to me. Can I have some help understanding it?