State-Action value fails to find optimal policy

If we set the values equal for both terminal points and a random misstep, like this:
terminal_left_reward = 100
terminal_right_reward = 100
each_step_reward = 0

Discount factor

gamma = 0.8

Probability of going in the wrong direction

misstep_prob = 0.5

It doesn’t even matter what the gamma is, the “optimal” policy seems to always indicate “go left”.
Seems like a bug in the code???

Hello @dmokran,

If you also look at the bottom chart, you will see that going left and going right are equally well, and it is just the code’s behavior that going left will be considered first.

Raymond

1 Like

Understood.
Thank you.