If we set the values equal for both terminal points and a random misstep, like this:
terminal_left_reward = 100
terminal_right_reward = 100
each_step_reward = 0
Discount factor
gamma = 0.8
Probability of going in the wrong direction
misstep_prob = 0.5
It doesn’t even matter what the gamma is, the “optimal” policy seems to always indicate “go left”.
Seems like a bug in the code???
