Random Stochastic Environment Question

By using the method described in the lecture, we get an average estimation for Q(s, a).
However, how do we use this quantity to tell us which action to take from a state?

For example in this image from the optional lab, if we used a misstep_prob = 0.7, we get the following Q(s, a). Following this, we do not reach a terminal point and instead are stuck in a loop in the middle. How does this work in practice?

Great question @Chandni_Kausika,

The optimal policy is generated under the assumption that there is a misstep_prob of 70%, which is also why we won’t get stuck in the middle because we may “misstep”. The optimal policy and the optimal Q-values are probablistically speaking, and if we really start exploring these states following the optimal policy, we can still end up with a different final total rewards than the optimal Q-value calculated for our initial state.

Raymond

1 Like

That makes sense! Thank you.

1 Like