Well, I see the Bellman Equation gives you the Return value (a number) for start in the state s, take action just once, behave optimally after that by recursively using the Q(s,a) which is the State Value Function.

I guess that is not right, but again, what is the the formal difference between the State Value Function and the Bellman Equation?

What do you mean by “iteration part of the bellman equation”? What is a iteration part? Can you highlight the iteration part on the slide that you shared earlier?

On the left side there is a Q(s,a), on the right a max of (q,a).
In order to calculate the max of (q,a), the Q(s,a) should be used.
Probably I’m missing something.

Thanks for clarifying. Let’s start with a quote of the Video “Bellman equation”:

If you can compute the state action value function Q of S, A, then it gives you a way to pick a good action from every scene. Just pick the action A, that gives you the largest value of Q of S,A. The question is, how do you compute these values Q of S,A? In reinforcement learning, there’s a key equation called the Bellman equation that will help us to compute the state action value function.

Therefore, the bellman equation is a formulation of how you can compute the State Action Value Function. Consider the State Action Value Function as a bigger thing, that you can compute it in whatever way you want, and then somebody comes up and tell you “hey, why don’t you compute it using the Bellman equation?”. What confused you is perhaps when the lecture first talked about the State Action Value Function, it already discussed it in a way exactly like the Bellman equation. However, they are two concepts (as concluded by the above quote) though they can share a lot of similarity in actual computations.

This should have addressed the “difference between them” as asked in the title of this thread.

As for “Markov Decision Process”, I think your question is not concise enough. If you want to connect Reinforcement Learning to Markov Decision Process, you only need to say that, our State Action Value Function takes in only the current state as the input to determine the next action, without considering any other previous states.

The Q(s,a) returns a single number, (the return), for which the below “story” is truth:

The first part:

start in state s.

take (first) action a once ← A. You have to decide what is the action to take. No problem, just take the action which gives you the highest return?.

The second part:

after you’ve taken the action once, you behave optimally after ← B. You have to decide what is optimal (B1: for all consecutive steps?)

Question1. How do you decide what action a you take in the beginning? Isn’t it that already for A you should take the action which gives you the highest return. Why don’t you behave optimal already for the first step, not after the first step.

Question2 about max Q (a’, s’) over a’. I can imagine the algorithm that calculates the maximum return when all the returns are known for sure. But this is not the case at the beginning, something must set the values for all returns for all possible actions of all states. (Adds to the question1: It says max over a’, not max over a.) OK, I know they are set randomly at the beginning. The process of updating the values leading the action toward highest reward is recursive.

I will stop thinking loud now. I see I’ll have to think it over quiet and take deeper look in the code. Maybe you have some advice .

I completed all three courses of this Machine Learning Specialization and earned the certificate, so my subscription is stopped per end of February. Do you know, will I still have access to this conversation and be able to discuss here, after my subscription is canceled?

Because we want to determin Q(s, a), and the a here is a variable. Q(s, a) means “hey, if I am at state s and I take an action a, what is the Q?”, or “hey if I am at this state, and I turn left, what would be my Q? what if I turn right, is that Q larger?” So, we can ask about Q(s, a = left) and Q(s, a=right). In Q(s, a), s and a are variables, right?

Why not? In the slide, we know the reward at each state, and they are 100, 0, 0, 0, 0, and 40 respectively. Why is it not known to us?

That is not a bad idea.

Yes! We can discuss here after your subscription ends.