State-Action value fails to find optimal policy

dmokran · January 29, 2023, 4:56am

If we set the values equal for both terminal points and a random misstep, like this:
terminal_left_reward = 100
terminal_right_reward = 100
each_step_reward = 0

Discount factor

gamma = 0.8

Probability of going in the wrong direction

misstep_prob = 0.5

It doesn’t even matter what the gamma is, the “optimal” policy seems to always indicate “go left”.
Seems like a bug in the code???

rmwkwok · January 29, 2023, 5:04am

Hello @dmokran,

If you also look at the bottom chart, you will see that going left and going right are equally well, and it is just the code’s behavior that going left will be considered first.

Raymond

dmokran · January 29, 2023, 6:24am

Understood.
Thank you.

Topic		Replies	Views
State Action Value Function misstep_prob = 0.9999 favors low rewards Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	484	August 10, 2022
Bellman Equation with Misstep Prob Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	368	January 12, 2024
State-action value function example question Unsupervised Learning, Recommenders, Reinforcement week-module-3	5	570	September 27, 2022
Random Stochastic Environment Question Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	493	August 8, 2022
Reward calculation Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	472	December 9, 2022

State-Action value fails to find optimal policy

Discount factor

Probability of going in the wrong direction

Related topics