I keep the probability loss = 0.1. I am not understanding Q(5, ->) = 18.52. someone clear my problem?

What should it be if you set the misstep probability to zero? And why?

when I kept probability loss = 0 then Q(5, ->) equals to 20

Why should it be 20?

Sorry I have to go very soon, it is 20 because it takes one step to get to the terminal in the right, and it gets discounted once, so we get 40*0.5 = 20.

As for the case of misstep probability equal 0.1, and that you start off in the 5th cell counting from the left, even if you decide that you should go to the right, there is a 10% chance that you will end up to the left and in such case it takes more than 1 step to get to a terminal (and consequently more discounts).

The program will simulate many such scenario, and there should be roughly 90% of time, it will move to the right immediately (yielding a perfect reward of 20 points), but 10% of time it will have to take a longer time for it to reach a terminal (yielding something less than 20 points). Because sometimes it takes a longer time, on average, the reward is smaller than the perfect case (20), and that imperfect reward turns out to be 18.52.

Cheers,

Raymond