Max Q(s',a') for continuous state spaces


I am having a hard time understanding the “max(Q(s’,a’))” for the continuous state spaces problems. As I understand, here we cannot just take an action and then “behave optimally”, since we do not really know which way to go due to the massive number of possibilities (as stated in the lab text), as opposed to the shown discrete problems where we easily could calculate what is the optimal behavior (go far left or far right).
Or have I misunderstood? Are we knowing what is the optimal behavior all the way down to landing?

In the lab we are taking one step at a time, wouldn’t that just make
max(Q(s’,a’))=max(R(s’))? where s’ is one of the possible states next to where we are?

I would really appreciate getting some help straighten out my thoughts. Thanks.

Hey @G11,
Welcome back to the community.

I would love to answer this question, since I have recently explored this myself, but as of now, it’s pretty late from where I belong. And the answer to this is a pretty intriguing one, so I would like to take some time to compose it. I will curate an answer for you by tomorrow. Till then, keep learning :nerd_face:


Hi @G11,

In the mean time, here is my attempt.

To begin with, Q is a neural network we know it is not perfect. We do not guarantee \max_{a'}Q to get us the true a'. However, we hope it is going to get better and better over training. This hope is driven by the inclusion of R(s).

Therefore, it can be considered as an assumption that \max_{a'}Q will pick the a' which behaves optimally, and the assumption is assumed to be stronger over training.

We don’t know. We make informed guesses with the help of the Q-network. We hope the guesses will be better and better over the course of training.

No. Did you actually prove that Q(s', a') = R(s')? You know your claim was not shown in any of the lecture materials, so what did you consider in doubting this claim?

@G11, I agree with you that in a continuous state space problem (and sometimes even a discrete one), we do not know the true Q(s, a), not to mention \max_{a'}Q(s, a). Since we do not know it, we do not ask ourselves for true Q values. All we are doing is just to make best and informed guesses of Q.

Here we transit from “knowing” to “making informed guesses”.

When we learn as a beginner, we learn with small examples, and it is easy for us to know everything. As we dig deeper, we meet with situations where we do not know everything, then we start to make informed guesses.

The Q-network is initialized randomly, but then over the course of training, we continuously “inject” information (i.e. rewards) in the hope that the Q-network is getting closer and closer to the truth.

Such Q-network trained with information allows us to make informed guesses.


Hey @G11,

Reference 1: “Reinforcement Learning: An Introduction”, by Sutton and Barto
Reference 2: “Reinforcement Learning Algorithms with Python”, by Andrea Lonza

Let’s try to see how we can deal with continuous action-spaces, and just to make this post complete, let me also tag in continuous state-spaces.

1. Uncoupling some concepts from lab

Starting off, let me uncouple a few things from the lab:

1.1. Q-Learning

The algorithm which has been used in the lab is DQN, and it consists of 2 major components, Deep Neural Networks and Q-Learning (which is an off-policy Temporal Difference based learning method). However, when we only use these 2 components, the researchers have observed high instabilities, as mentioned in the lab, so there are 2 other additions to it: Target Q-Network and Experience Replay, which are also described in the lab.

Why I am repeating this? Just to highlight the fact that Q-Learning is an individual algorithm (not discussed in the course) and forms the foundation of DQN, while DQN is only an extension to it. So, in case you want to learn more about DQN, I encourage you to first learn everything you can about Q-Learning, for which you can refer Ref 1 (Chapter 6) and Ref 2 (Chapter 4).

1.2. Function Approximation

This phrase has not been mentioned in the course or in the lab, for simplicity purposes, but DQN is a function-approximation based method. Now, what is meant by that? Well, the answer is very long, since it is an entire research area in itself, but let me present the short version here for you.

In the lab, the state-space is continuous, as you would have seen float values in the state-vector. But unlike this, when we have small and finite state-spaces, we can use tabular methods, like Monte Carlo, SARSA, Q-Learning, Expected-SARSA, etc. However, when the state-space becomes large (for instance, millions of discrete states, or continuous), in that case, we need methods that can generalise over many states. This is because, all the tabular methods rely on the assumption that each state in the state space will be visited a large number of times during training, which is not possible for large state-spaces.

The good thing here is that we don’t need to learn new RL methods for that. There are some methods and algorithms, also referred to as generalisation methods, which come under the umbrella of Function Approximation, which can take existing tabular methods, and generalise them over large state-spaces. So, DQN, is nothing but Q-Learning + Non-Linear Function Approximation. You can read more about Function Approximation in Ref 1 (Part II).

2. Methods dealing with continuous spaces

Now, there has been an extensive study on how to deal with such scenarios. Let me present to you the two most intuitive and perhaps the most popular approaches as well.

2.1. Action-Value Methods + Feature Construction

Let’s first try to understand what is meant by feature construction. In this lab, we have directly fed the state vector into the neural networks without performing any operations on it. However, there is a whole umbrella of methods which comes under feature construction which we can deploy to transform our state and action spaces. Let me present here one of the simplest ones, which is Tile-Coding (a form of Coarse Coding), which is a fancier form of Discretization. Consider an example in which the state space is continuous and ranges from [-2, 2]. In this case, one of the simplest discretizations is as follows:

  • [-2, -1) = Represented by 0
  • [-1, 0) = Represented by 1
  • [0, +1) = Represented by 2
  • [+1, +2] = Represented by 3

In a similar fashion, we can discretize our action space as well. Now, Tile-Coding takes this to a whole another level by improving many attributes of this, which you can read more about in Ref 1 (Chapter 9).

Now, let’s try to understand, what are action-value methods. Once again, a whole another umbrella of methods, which includes DQN as well. In action-value methods, we learn the values of actions and then select the actions based on their estimated action values. If we take DQN as a reference, the q_network and the target_q_network produce action values for the input states, which are represented by q_values, and the below line of code:

action = utils.get_action(q_values, epsilon)

selects the action based on these values.

Now, I assume we have a basic understanding of both these terminologies. And now, we can easily discuss the method, which is nothing but “Using some form of feature construction + some action-value method”. A simple example, for continuous action spaces, you can discretize the action-space, and use DQN itself.

2.2. Policy Gradient Methods

I believe this category of the methods is more powerful in dealing with continuous spaces, particularly, when we have continuous action-spaces. In this category of methods, we instead learn a
parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

Now, this category of methods offers some amazing advantages for continuous action-spaces. Instead of computing learned probabilities for each of the many actions, we instead learn statistics of the probability distribution. For example, the action set might be the real numbers, with actions chosen from a normal (Gaussian) distribution. You can read more about Policy Gradient methods in Ref 1 (Chapter 13).

3. Optimal Behaviour

The short version:

  • During training, we don’t know the optimal behaviour till the terminal states.
  • Once the agent is trained, based on how accurate our trained agent is, we can estimate the optimal behaviour till the terminal states, with that accuracy.

However, a more interesting question here would be, “During training, how far into the future we can look, and compute the estimates?”. Interestingly :joy:, here’s another whole category of methods, known as n-step methods, which deals with this. Monte Carlo methods, are often known as infinite-step methods, and they come with minimal bias and maximal variance regarding their estimates. Methods like SARSA, Q-Learning, Expected SARSA, etc are 1-step methods, and they come with maximal bias and minimal variance regarding their estimates. It is the n-step methods, which comes with the best of both the worlds. You can combine n-step methods with DQN, which would make it a variation of DQN, popularly known as N-step DQN. You can read more about DQN and it’s variations in Ref 2 (Chapter 5), and you can read more about n-step methods in Ref 1 (Chapter 7).

  1. Conclusion
    In this thread, I have extensively referred to the 2 sources which I mentioned towards the beginning of this post, since I have learnt from these 2 sources only.

For keeping this thread as concise as possible, I have skipped a lot of details as well, since it’s impossible for me to summarize them all, due to 2 reasons; first, this post will become infinitely long; and second, even I am a beginner in Reinforcement Learning as of now.

If you find some of the concepts mentioned in this thread confusing, please feel free to skip over them. The only thing you need to keep in mind is that Reinforcement Learning has a very well-built foundational structure, and on each of it’s aspects, there is considerable research going on right now. You are more than welcome to explore the intricacies of RL.


Thanks a lot for your answer @rmwkwok. A follow-up question then. Again for the continuous state spaces, is Q(s’,a’) a vector? I mean, since we using “max”, it must be. How long is that vector?
For the discrete problem the length of that vector is = 2 (return going far left, return going far right). What is it for the lunar lander?

Hi @G11,

You could check the assignment youself. That lunar lander had 4 actions, and its Q-network’s output had 4 values.

If you have time, and if you are interested, consider how you would model it if it is not 4 discrete actions, but some continuous thrust value. If you cannot find any ideas from the lectures, that means an idea might need to be found on the internet.


1 Like