In this algorithm, as misstep probability is added, there are many possibilities, like going to 100 directly or going to the right and then next step going to the left again till 100. So combining all the options Expected return is calculated based on average of this :
Hope This solves your query,
Feel free to ask if you have any other doubts
Thank you for your answer.
So in order to get these results by hand, I have to calculate all the possible combinations of actions and states? So… practically this is impossible by hand to verify right?
Yes kind of you can’t say it is impossible but yes it is tedious. It is like cost minimization in linear regression which can take so many steps which you can check while writing your algorithm.
Yes I understand thank you for the prompt response.
Do you know perhaps in the code where this is implemented? I mean specifically the various possible combinations for the actions. Inside the utils.py file. I am trying to find the call stack of the lines that are called using the misstep_prob=1 variable but I can’t quite figure out the exact point in code where all these combinations are calculated.
In utils.py, the misstep probability is used in the generate_transition_prob() function.