State-action Value Function - Video


I have a couple of questions here:

  1. I still don’t understand why we can’t just “behave optimally” starting in any state, why we need to “take action a” and then behave optimally? Dr. Ng said it’d become clear later on but I finished all State-action Value Function videos and still don’t understand why.

  2. “take action a (once)” - which action do we take here? Random or following some logic?

  3. Also, what is the intuition behind the discount function? In the example of the Mars Rover, why won’t we want it to go for the highest reward from any state, and instead use the discount function that basically penalizes each action?

thank you!

Hello @Svetlana_Verthein,

  1. recognize there is an a in the parameter list of Q(s, a)

  2. it means that a is a parameter of our choice

  3. it further means that, Q(s, a) is made to tell us "what is the Q value if I am at state s and take action a

  4. therefore, it is free to choose whatever a it is, and then Q(s,a) should return an estimation.

The above flow lays out the purpose for Q(s, a). It is a very useful function. Why? Because we can use the information provided by Q(s, a) to make the best decision.

For example, I am in state 100, and there are two actions I can take: action 0 and action 1. Then here are the steps:

  1. I ask for Q(s=100, a=0). What does it tell me? after I take action = 0, what kind of rewards am I expecting for?

  2. I ask for Q(s=100, a=1). What does it tell me? after I take action = 1, what kind of rewards am I expecting for?

  3. If Q(s=100, a=0) = 100000 and Q(s=100, a=1) = 0, it means that, 100000rewards are waiting for me if I take the action 0, so I choose that action.

Now I hope we see the purpose for Q(s, a) → let us find the best action.

In my above example, is it random or following any logic? Both NOT. I simply asked for the Q value for each of the actions. Than I pick the best action.

The Mars Rover example is too easy to explain what problem we are actually solving. Let’s consider another example:

Consider person A is dropped off in an island that the person has never visited. The dropoff point is surrounded by trees. The person is asked to find the only lake on that island but the person cannot see any sign of water source from the dropoff point.

Now, the person A is ready to take their first step.

Let me ask you this: @Svetlana_Verthein, given that this is all the information about the island both that person and we know, can we decide for that person a straight way to the lake? We probably can’t. Right? Let’s be honest, because we don’t know. However, what would you recommend that person to consider before taking the next step?


Thank you, Raymond.

This is where I get confused: are values of each state known in advance? In the Mars Rover example it seemed like they were.

In that case I understand how the Q function works, as you explained above.
This perfectly makes sense - and I understand how Q can calculate action value at each state for each action.

In your example of the man on an island it seems to me that only the terminal state’s value (location of the lake) is known in advance, but not of the intermediate states. How does the Q function decide which action to take? Doesn’t it need to know the values of each state to calculate each action value for that state? In the Mars Rover example it seemed like it does, but in your island example it seems like it doesn’t.

I think my questions demonstrate that I am missing something fundamental here, so I am going to re-watch the videos again, and read up on this on the Internet. I need to understand how this all works, and I think I need to spend more time on this rather than asking random questions :smile:
thank you for your detailed answers!

Hello @Svetlana_Verthein

No, it is not always known in advance. The Mars Rover example is for the sake of demonstrating concepts like the Bellman equation. The example gives us a simple problem with all the known numbers, so that we can plug in those numbers into the equations learnt. The example enables us to practice the concepts. However, the concepts learnt in the lecture are not only used in one example. Even though we have known everything from the beginning in the Mars Rover example, in a general reinforcement learning problem, including the one presented in the assignment, we don’t know everything in advance.

The lake searching problem is such an example. We know there is a lake but we don’t know where it is. The only island related information available is just whatever the person sees from the dropoff point.

Another key to remember is that the Q function is a function the helps us make decision by telling us the Q value if we take an action from a state. However, the function doesn’t necessarily return the true Q value. For the example of the Mars Rover, it is returning the true value because we have known everything in advance. For the example of lake searching, we don’t know everything.

We can further discuss a strategy and relate the discussion to the lecture content. For example, besides lacking of island-related information, if the person is also not experienced in lake searching, then the “exploration” strategy should be adopted for us to explore and learn about this unknown island. This idea is covered in the lecture video: Algorithm refinement: ϵ-greedy policy. Whatever we gain from that exploration process will be used to train our Q function. Note that, this time the Q function is going to improve over time as we explore, and it also means that the Q function won’t be perfect from the beginning like the Mars Rover example. Again, this is because we don’t know everything in advance.

I do agree with you to re-watch the videos. I suggest you to keep your unanswered doubt, rewatch all the videos, and try to complete the assignment. Keep in mind that the Mars Rover example is a very small, managable example to demo ideas that we should be able to bring over to a more realistic problem when we know a little or nothing at all about the environment.

It is also normal for us to rewatch the videos a few times. I do this all the time too. The videos were always prepared in a designed perspective and there is no guarantee that it must meet the way we look at things.


There is a great deal of magic in learning the Q matrix values, which is not really covered here in any detail.

Thank you both. I was really confused by the formula for Q which is a recursive that uses the future state, not the past, as they taught us in CS. I kept thinking: how on earth can we calculate Q (s, a) for the first time when we don’t know the next state/action? I get it now - we just initialize all Q values randomly.
Whew, that was a big breakthrough :rofl:
I read up the easier blogs/articles on this topic online, and I think I have a much better grip now.
Thanks again!

@Svetlana_Verthein, now I can see more why you had felt confused. To me, any model is about an encapsulation of knowledge we have about a world (I said a world because it can be our mars rover world, or our assignment’s lunar lander world, and so on). The question is how we encapsulate the knowledge into the model. The answer is we train it. There are many ways to train it. For a very simple world like the Mars Rover, it is basically transparent to us. We know every single bit of details about it. It is so simple that we didn’t even feel the process of training a model. However, we had trained it. Our model is still called “Q(s, a)”, and the moment we know the answers Q to all combination of (s, a) , we had trained it. Only the training process is not the sophisticated gradient descent, or any thing that takes us months to learn. It’s just the bellman equation, basic maths, and perhaps a calculator will do.

For the more complicated lunar lander (the assignment) world, it is less trivial. There are physics in it. It can crash. It can lose control. It is in this kind of example that we are no longer on top of it. We initialize Q(s,a) randomly, and we inject information into the model bit by bit throughout the training process, with the hope that the final model is going to work well. Yes, it is just hope. There is no guarantee.

When I learnt Reinforcement Learning, I have read a lot, though it is a shame that I didn’t keep those links. Keep it up! That’s the right way!


Thank you, Raymond and Tom, for the words of encouragement and the good advice you’ve provided
Well, I got my certificate for this course, and I finished MLS. I’ll take a week to go over everything I’ve learned in this specialization, re-do all the assignments to solidify my knowledge, and move onto the DLS course.

In parting I just wanted to say (as I did in my ratings) that this course was just right for me, not too hard, not too easy, not too long nor too short. Dr. Ng’s presentation was brilliant, as always, and the community - you, guys! - provided excellent support.

Just a couple of overall observations I wanted to make:

  1. It’d be very useful to have a slide on how to approach building a new neural network - how many layers to start with, how many units, how to adjust it and based on what. I noticed this question was asked often.

  2. I was confused about the Q (s, a) values’ initialization, as we discussed here, and I was similarly confused about the initial values for w and b values earlier in the NN course. Couldn’t figure out where to start, where to get the initial values - and in both courses it turned out to be initializing to random values. Although I knew w and b were initialized randomly in linear and logistic regressions, somehow that thought didn’t occur to me when it came to NN. It was a big source of confusion and frustration for me. I think this also would be useful if explicitly mentioned throughout the relevant lectures.

  3. Almost forgot - I think back prop should be included in the NN lectures, and how it works for NN. The optional lecture on back prop was a basic explanation of the chain rule in calculus, but not specifically why it matters in NN and how it’s applied. And how much it matters computationally meaning one can’t just design super-deep NNs etc. I found good articles online, and without them I must say I’d be very confused about what’s going on

Thanks again, and all my best!

Hi @Svetlana_Verthein,

Thank you for your suggestions. It seems to me that your learning passion has exceeded the scope of this specialization. I think it is one of the reasons why the material of this course was not enough for you.

It is true that your point number one is a frequently asked questions, and I have thought about this - whether I can propose anything to the course team. However, I am always reluctant to do so. To begin with, there is simply no rule of thumb for this matter. It is very problem dependent, and data size dependent as well. What’s more, the dependence is not exact - I cannot write down equations to return the details about the architecture given (e.g.) data size, not to mention that it is hard to “quantify a problem”.

On the other hand, even though it is easy to blindly suggest some architectures, it might be misleading. I personally would be worried whether anyone would take those suggestions for granted and miss out the more important iterative cycles of development (MLS Course 2 Week 3, and DLS Course 2 Week 1 and Course 3 Week 1) that is a process which requires us to continuously monitor the performance and brainstorm improvement. Instead of a rule that everybody can follow and build a good NN, it is actually an investigative process that requires knowledge, skills, experience, and sometimes creativity to do so. Perhaps this is one reason why this job can’t be replaced by some automated computer program.

I would rather persuade learners that there is no rule of thumb as to how to build a NN, but suggest them to experiment (for experienece), read what techniques other neural networks were used to make it work (for skills), how people explain those architectures and techniques (for knowledge), and so on. Believe it or not, I think 80% of the whole 5-course Deep Learning Specialization is about these. IMHO, it is therefore also a challenge to include that in our MLS. However, learners have every right to ask that question. It is a very natural question. However, there is just no simple answer to that. It’s like when someone asks “how can I be successful?”.

@Svetlana_Verthein, I hope after the DLS, you will be more able to tell how you will start your neural network architecture to the problem that you need to solve :wink: