By using the method described in the lecture, we get an average estimation for Q(s, a).
However, how do we use this quantity to tell us which action to take from a state?
For example in this image from the optional lab, if we used a misstep_prob = 0.7, we get the following Q(s, a). Following this, we do not reach a terminal point and instead are stuck in a loop in the middle. How does this work in practice?
Great question @Chandni_Kausika,
The optimal policy is generated under the assumption that there is a misstep_prob of 70%, which is also why we won’t get stuck in the middle because we may “misstep”. The optimal policy and the optimal Q-values are probablistically speaking, and if we really start exploring these states following the optimal policy, we can still end up with a different final total rewards than the optimal Q-value calculated for our initial state.
Raymond
1 Like
That makes sense! Thank you.
1 Like