Confused between bellman and mdp process

In the MDP (markov decision process) means future only depends on the current state. So in this we have s and a but not s' and a'.

In the bellman equation, we use the Q' in the Q calculation so assuming we have the return of next state prior (even if it’s random, there must be something)

\vec Q(s, a) = R(s) + \gamma \cdot max(\vec Q(s', a'))

here |\vec Q| means number of actions

So what do you think?

NOTE The vector notation mine, not used in the labs or video. This is what I think

Hi @tbhaxor

This is quoted from the lecture,


Let me ask you a question:

Let’s say a person is fishing, and here comes a moment that the person needs to decide whether to reel or not. What do you think about how the person will come to the decision? Does that person know the future before making the decision? Or, does that person estimate the future before making the decision.

Note the difference between “knowing the future” and “estimating the future”. The former is a gift and is superpower, whereas the latter is just experience.


Shouldnt we always estimate the future. this the term we use is from inferential statistics

Saying this is correct

Indeed, we cannot foresee anyway. So we estimate, but based on what? I am not a fishing expert actually, but in the fishing example, I would say, (1) whether the fishing line has moved, (2) how long it has moved, and perhaps many more that you can think of. They are all states. They are all states that we have observed. None of them comes from the future - we don’t know any future states. The thing is, given the current state s , what is the Q value if the person take the action a (reel or not reel).

The Q function, in the fishing example, is essentially the brain of that person who can process the states and able to tell whether it is going to be more rewarding if reel or if not reel. Agree?

I see, so basically we take actions based on feedback.

(1) whether the fishing line has moved, (2) how long it has moved, and perhaps many more that you can think of

This is feedback from the environment E where currently agent \mathbf{A} is

We can call them feedback from the environment. We can also call it state.

And Q function is nothing more than an encapsulation of past experience about fishing, right? We never make decision based on the future, but based on what we have learnt in the past. If that person is very experienced in fishing, then that person’s brain (Q-function) is going to give a better prediction to the Q value of reeling, and the Q value of not reeling. From these two values, the person only needs to pick the action that rewards more.


Let’s be honest, none of us can make decision based on the future. We can at most say, we believe the future is going to be that way based on our past experience. We never really know the future. We need to accept that we don’t know the future. Otherwise, we cannot move on.