Bellman Equation with Misstep Prob

Antonio_Furtado · January 12, 2024, 12:49pm

I was playing around with the State-action value function example lab and I noticed something interesting:

Misstep_prob = 0: The optimal policy and Q(s,a) are such that if you are in state 5, you will move right and for states 4 and lower you will move left.

Misstep_prob = 0.1: Here the behaviour is the same as above but the values of Q(s,a) are slightly lower.

Misstep_prob = 0.5: Here something interesting happens. Essentially a 0.5 misstep means that every step is a coin toss as to whether you get it right or not (the action result is random). The jupiter notebook takes a while to calculate this scenario. But once the results are in, the Q(s,a)s of moving right or left in every state are equal and yet the optimal policy seems to always be to move left towards state 1 for the 100 reward. Why is that? I would imagine if the rewards from moving right or left are equal, the rover would not be able to move to the next state.

Misstep_prob = 0.7: Here the optimal policy has become to actually move from state 5 to the mid point between state 1 and state 6. The rover would move left until he reached state 3 at which point he would move right to state 4 and would keep bouncing between state 3 and 4. Why?

Misstep_prob = 1: Same behavior as above but now the rover would be bouncing between states 5 and 4.

rmwkwok · January 12, 2024, 3:14pm

Hello @Antonio_Furtado,

It is the default to move left if they are equal.

Check this out for the result of 0.5 misstep probability.

Can you share a screenshot like in the above link that contains the two tables (Optimal policy and Q(s,a)) for the 0.7 case?

Cheers,
Raymond

Topic		Replies	Views
State Action Value Function misstep_prob = 0.9999 favors low rewards Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	481	August 10, 2022
Random Stochastic Environment Question Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	492	August 8, 2022
State-action value function example question Unsupervised Learning, Recommenders, Reinforcement week-module-3	5	568	September 27, 2022
State-Action value fails to find optimal policy Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	524	January 29, 2023
Random (stochastic) environment Q-values question Unsupervised Learning, Recommenders, Reinforcement week-module-3	5	490	April 14, 2023

Bellman Equation with Misstep Prob

Related topics