By using the method described in the lecture, we get an average estimation for Q(s, a).

However, how do we use this quantity to tell us which action to take from a state?

For example in this image from the optional lab, if we used a misstep_prob = 0.7, we get the following Q(s, a). Following this, we do not reach a terminal point and instead are stuck in a loop in the middle. How does this work in practice?

Great question @Chandni_Kausika,

The optimal policy is generated under the assumption that there is a misstep_prob of 70%, which is also why we wonâ€™t get stuck in the middle because we may â€śmisstepâ€ť. The optimal policy and the optimal Q-values are probablistically speaking, and if we really start exploring these states following the optimal policy, we can still end up with a different final total rewards than the optimal Q-value calculated for our initial state.

Raymond

1 Like

That makes sense! Thank you.

1 Like