Random (Stochastic) Environment Reinforcement Learning

Hi Mentor,

In Reinforcement learning random stochastic environment optional video lecture, how can be the sequence of different rewards can be mapped to bellman equation ? ie how the max(s’,a’) mapped to the second term in the random sequence of different rewards ?

Example: From state 4, below are the random sequence of different rewards

If it’s so on like, here is our Bellman equation, here E[MaxQ(s’,a’)] how its mapped to the second term in the above random sequence of different rewards

Hello @Anbu,

As prof. andrew said in the lecture video that In the stochastic problem, there would be a sequence of different rewards instead of a single sequence of rewards. Therefore, we are interested in maximizing the expected (average) return across all possible sequences of different rewards because it is random.

Given our limited information about the agent’s next step from state 4, so we just take the average.

This is how the Bellman equation is modified, if you take an action ‘a’ in state ‘s’, the next state ‘s’’ would be random, so you would expect the average of future rewards, denoted as E[MaxQ(s’, a’)]. So the total reward return from state ‘s’ is the sum of immediate reward of state ‘s’ and discount factor gamma what you expect to get average of future returns."


Sir, Thanks but i needs still clarification please help

MaxQ(s’,a’) is basically nothing but best possible return from the next state s’. if stochastics random problem means, assume current state the rover is in state 4, if the next step random means it goes to the state 2, state 4 since random. Then MaxQ(s’,a’) indicates we need to see the best possible return from state 2 and state 4, then do average of the results E[MaxQ(s’,a’)]. This is my understanding of MaxQ(s’,a’)

But MaxQ(s’,a’) does not make sense right because for stochastic problem, we will tends to 1000 different sequence of rewards and we are going to do average of the returns from 1000 sequence of rewards. If its so what is the role play of max here ? Because max is a sign of optimal return from the next state s’ right sir but we are not doing the seek for optimal return (highest possible return)

If you give some example would be helpful

Hey @Anbu,

In this statement, you are missing the next action, i.e., a', and this is in fact what the max operation deals with. From a particular next state s', in this environment there are 2 possible actions, left and right. So the max operation will choose the return corresponding to the maximising information.

Rest, there are a lot of details regarding the basics of RL which have been overlooked in this week, due to multiple reasons such as time constraints, interpretability, etc. A simple example would be, this week only deals with optimal conditions, i.e., optimal state and action values, optimal policies, etc; but in fact, the optimality conditions are obtained only after the agent goes through non-optimal conditions, which are represented by non-optimal policies and the sort. I hope this resolves your query.