State-action value function example question

Mo_Okasha · September 25, 2022, 10:03pm

Hello,

Do Bellman equations still apply if I set the left reward at zero? I would have expected the values encircled in the generated graph to be zero in this case.

rmwkwok · September 26, 2022, 3:52pm

Hello @Mo_Okasha,

I believe you are setting the misstep_prob to 0.4 which means that even if we decide to go LEFT, there is still a chance that the robot will go RIGHT and finally get some rewards. This is why you see those encircled positive rewards.

Cheers,
Raymond

Mo_Okasha · September 26, 2022, 6:18pm

Hello @rmwkwok ,

No that was just me trying to play with the misstep_prob for another reason. However, even with misstep_prob is set to zero, i still see values for the left direction. Below is an updated screenshot. Thanks!

rmwkwok · September 27, 2022, 12:21am

Hello @Mo_Okasha,

Perhaps try to re-run all the cells after setting 0 for misstep_prob?

I did the tests myself with misstep_prob be 0 or 0.4. I can reproduce your result with 0.4, and as for misstep_prob = 0, the LEFT value in state s is always half the RIGHT value in state s-1 because every time it takes the LEFT action in state s and reach s-1, it will then turn to go RIGHT to achieve the biggest value. It is “halved” because of the gamma, it will turn RIGHT because Q is defined as followed:

My screenshots below:

misstep_prob = 0

misstep_prob = 0.4

Raymond

Mo_Okasha · September 27, 2022, 4:46am

But in your solution, I see that some LEFT values aren’t zero with misstep_prob = 0. Please see encircled values below. Thanks!

rmwkwok · September 27, 2022, 6:22am

Thank you. I modified my answer.

Topic		Replies	Views
Bellman Equation with Misstep Prob Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	368	January 12, 2024
State Action Value Function misstep_prob = 0.9999 favors low rewards Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	484	August 10, 2022
State-action value function example? Unsupervised Learning, Recommenders, Reinforcement week-module-3	8	609	September 9, 2022
Quiz problem in bellman Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	479	March 17, 2023
Discrepancy of return values for the same model Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	265	January 22, 2024

State-action value function example question

Related topics