State Action Value Function misstep_prob = 0.9999 favors low rewards

mosofsky · August 10, 2022, 10:49pm

I wanted to see how the State Action Value Function handles a high misstep_prob because it seems to me most real-world applications would have to deal with a lot of errors due to not being able to model all the influential variables.

However, I did not expect that the optimal policy would always favor the lower reward. Here is a pretty extreme example:

terminal_left_reward = 10000
terminal_right_reward = 1
each_step_reward = 0

# Discount factor
gamma = 1

# Probability of going in the wrong direction
misstep_prob = 0.9999

Why does the policy always favor the lower reward?

download

I was able to replicate the same result by swapping the left and right reward:

terminal_left_reward = 1
terminal_right_reward = 10000

download

This doesn’t make sense to me. Can I have some help understanding it?

rmwkwok · August 10, 2022, 11:34pm

Hello Michael @mosofsky ,

A misstep_prob of 1 means that it must go in the wrong direction: if we decide to go left, it MUST go to the right, 100% probability.

For your case of misstep_prob=0.9999, it almost always goes in the wrong direction.

Now look at your case where terminal_left_reward=10000, The optimal policy favors the action of moving to the right, so that it will almost always move to the left as a result of a wrong step.

If you want to see the most randomized case, set misstep_prob = 0.5, then at each state, the optimal Q value to the left and to the right should be the same.

Cheers,
Raymond

Topic		Replies	Views
State-Action value fails to find optimal policy Unsupervised Learning, Recommenders, Reinforcement week-3	2	520	January 29, 2023
Bellman Equation with Misstep Prob Unsupervised Learning, Recommenders, Reinforcement week-3	1	356	January 12, 2024
State-action value function example question Unsupervised Learning, Recommenders, Reinforcement week-3	5	556	September 27, 2022
Week 3 Reinforcement Learning, Quiz Optional Unsupervised Learning, Recommenders, Reinforcement week-3	4	494	January 3, 2023
Week 3 lecture video has error _ State_action Value Function definition Unsupervised Learning, Recommenders, Reinforcement week-3	5	450	August 13, 2024

State Action Value Function misstep_prob = 0.9999 favors low rewards

Related topics