Random (Stochastic) Environment Reinforcement Learning

Anbu · April 15, 2023, 5:24pm

Hi Mentor,

In Reinforcement learning random stochastic environment optional video lecture, how can be the sequence of different rewards can be mapped to bellman equation ? ie how the max(s’,a’) mapped to the second term in the random sequence of different rewards ?

Example: From state 4, below are the random sequence of different rewards

If it’s so on like, here is our Bellman equation, here E[MaxQ(s’,a’)] how its mapped to the second term in the above random sequence of different rewards

Mujassim_Jamal · April 16, 2023, 4:42am

Hello @Anbu,

As prof. andrew said in the lecture video that In the stochastic problem, there would be a sequence of different rewards instead of a single sequence of rewards. Therefore, we are interested in maximizing the expected (average) return across all possible sequences of different rewards because it is random.

Given our limited information about the agent’s next step from state 4, so we just take the average.

This is how the Bellman equation is modified, if you take an action ‘a’ in state ‘s’, the next state ‘s’’ would be random, so you would expect the average of future rewards, denoted as E[MaxQ(s’, a’)]. So the total reward return from state ‘s’ is the sum of immediate reward of state ‘s’ and discount factor gamma what you expect to get average of future returns."

Best,
Mujassim

Anbu · April 16, 2023, 5:28am

Sir, Thanks but i needs still clarification please help

MaxQ(s’,a’) is basically nothing but best possible return from the next state s’. if stochastics random problem means, assume current state the rover is in state 4, if the next step random means it goes to the state 2, state 4 since random. Then MaxQ(s’,a’) indicates we need to see the best possible return from state 2 and state 4, then do average of the results E[MaxQ(s’,a’)]. This is my understanding of MaxQ(s’,a’)

But MaxQ(s’,a’) does not make sense right because for stochastic problem, we will tends to 1000 different sequence of rewards and we are going to do average of the returns from 1000 sequence of rewards. If its so what is the role play of max here ? Because max is a sign of optimal return from the next state s’ right sir but we are not doing the seek for optimal return (highest possible return)

If you give some example would be helpful

Elemento · April 16, 2023, 8:05am

Hey @Anbu,

In this statement, you are missing the next action, i.e., a', and this is in fact what the max operation deals with. From a particular next state s', in this environment there are 2 possible actions, left and right. So the max operation will choose the return corresponding to the maximising information.

Rest, there are a lot of details regarding the basics of RL which have been overlooked in this week, due to multiple reasons such as time constraints, interpretability, etc. A simple example would be, this week only deals with optimal conditions, i.e., optimal state and action values, optimal policies, etc; but in fact, the optimality conditions are obtained only after the agent goes through non-optimal conditions, which are represented by non-optimal policies and the sort. I hope this resolves your query.

Cheers,
Elemento

Topic		Replies	Views
Notes on the Bellman equation Unsupervised Learning, Recommenders, Reinforcement week-3	1	56	October 15, 2024
Unsupervised Learning: Bellman Equation example looks incorrect Unsupervised Learning, Recommenders, Reinforcement week-3	4	80	September 22, 2024
Inconsistent definition for the Bellman equations Unsupervised Learning, Recommenders, Reinforcement week-3	10	730	August 18, 2022
Confusion on Target Variable Deep Reinforcement Unsupervised Learning, Recommenders, Reinforcement week-3	28	937	September 15, 2022
Question about state value function learning algo Unsupervised Learning, Recommenders, Reinforcement week-3	4	520	April 19, 2023

Random (Stochastic) Environment Reinforcement Learning

Related topics