Random Stochastic Environment Question

Chandni_Kausika · August 8, 2022, 5:03am

By using the method described in the lecture, we get an average estimation for Q(s, a).
However, how do we use this quantity to tell us which action to take from a state?

For example in this image from the optional lab, if we used a misstep_prob = 0.7, we get the following Q(s, a). Following this, we do not reach a terminal point and instead are stuck in a loop in the middle. How does this work in practice?

rmwkwok · August 8, 2022, 5:29am

Great question @Chandni_Kausika,

The optimal policy is generated under the assumption that there is a misstep_prob of 70%, which is also why we won’t get stuck in the middle because we may “misstep”. The optimal policy and the optimal Q-values are probablistically speaking, and if we really start exploring these states following the optimal policy, we can still end up with a different final total rewards than the optimal Q-value calculated for our initial state.

Raymond

Chandni_Kausika · August 8, 2022, 6:26am

That makes sense! Thank you.

Topic		Replies	Views
Bellman Equation with Misstep Prob Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	368	January 12, 2024
State Action Value Function misstep_prob = 0.9999 favors low rewards Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	484	August 10, 2022
Random (stochastic) environment Q-values question Unsupervised Learning, Recommenders, Reinforcement week-module-3	5	492	April 14, 2023
State-Action value fails to find optimal policy Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	526	January 29, 2023
Week 3 Reinforcement Learning, Quiz Optional Unsupervised Learning, Recommenders, Reinforcement week-module-3	4	502	January 3, 2023

Random Stochastic Environment Question

Related topics