Hey @G11,

**Reference 1**: “Reinforcement Learning: An Introduction”, by Sutton and Barto

**Reference 2**: “Reinforcement Learning Algorithms with Python”, by Andrea Lonza

Let’s try to see how we can deal with **continuous action-spaces**, and just to make this post complete, let me also tag in **continuous state-spaces**.

# 1. Uncoupling some concepts from lab

Starting off, let me uncouple a few things from the lab:

## 1.1. Q-Learning

The algorithm which has been used in the lab is **DQN**, and it consists of 2 major components, **Deep Neural Networks** and **Q-Learning** (*which is an off-policy Temporal Difference based learning method*). However, when we only use these 2 components, the researchers have observed high instabilities, as mentioned in the lab, so there are 2 other additions to it: **Target Q-Network** and **Experience Replay**, which are also described in the lab.

Why I am repeating this? Just to highlight the fact that **Q-Learning** is an individual algorithm (*not discussed in the course*) and forms the foundation of DQN, while DQN is only an extension to it. So, in case you want to learn more about DQN, I encourage you to first learn everything you can about **Q-Learning**, for which you can refer Ref 1 (*Chapter 6*) and Ref 2 (*Chapter 4*).

## 1.2. Function Approximation

This phrase has not been mentioned in the course or in the lab, for simplicity purposes, but DQN is a function-approximation based method. Now, what is meant by that? Well, the answer is very long, since it is an entire research area in itself, but let me present the short version here for you.

In the lab, the state-space is continuous, as you would have seen float values in the state-vector. But unlike this, when we have small and finite state-spaces, we can use **tabular methods**, like Monte Carlo, SARSA, Q-Learning, Expected-SARSA, etc. However, when the state-space becomes large (*for instance, millions of discrete states, or continuous*), in that case, we need methods that can generalise over many states. This is because, all the tabular methods rely on the assumption that each state in the state space will be visited a large number of times during training, which is not possible for large state-spaces.

The good thing here is that we don’t need to learn new RL methods for that. There are some methods and algorithms, also referred to as **generalisation methods**, which come under the umbrella of **Function Approximation**, which can take existing tabular methods, and generalise them over large state-spaces. So, DQN, is nothing but **Q-Learning + Non-Linear Function Approximation**. You can read more about Function Approximation in Ref 1 (*Part II*).

# 2. Methods dealing with continuous spaces

Now, there has been an extensive study on how to deal with such scenarios. Let me present to you the two most intuitive and perhaps the most popular approaches as well.

## 2.1. Action-Value Methods + Feature Construction

Let’s first try to understand what is meant by **feature construction**. In this lab, we have directly fed the state vector into the neural networks without performing any operations on it. However, there is a whole umbrella of methods which comes under **feature construction** which we can deploy to transform our state and action spaces. Let me present here one of the simplest ones, which is **Tile-Coding** (*a form of Coarse Coding*), which is a fancier form of **Discretization**. Consider an example in which the state space is continuous and ranges from `[-2, 2]`

. In this case, one of the simplest discretizations is as follows:

- [-2, -1) = Represented by 0
- [-1, 0) = Represented by 1
- [0, +1) = Represented by 2
- [+1, +2] = Represented by 3

In a similar fashion, we can discretize our action space as well. Now, Tile-Coding takes this to a whole another level by improving many attributes of this, which you can read more about in Ref 1 (*Chapter 9*).

Now, let’s try to understand, what are **action-value methods**. Once again, a whole another umbrella of methods, which includes DQN as well. In action-value methods, we learn the values of actions and then select the actions based on their estimated action values. If we take DQN as a reference, the `q_network`

and the `target_q_network`

produce action values for the input states, which are represented by `q_values`

, and the below line of code:

action = utils.get_action(q_values, epsilon)

selects the `action`

based on these values.

Now, I assume we have a basic understanding of both these terminologies. And now, we can easily discuss the method, which is nothing but “Using some form of feature construction + some action-value method”. A simple example, for continuous action spaces, you can discretize the action-space, and use DQN itself.

## 2.2. Policy Gradient Methods

I believe this category of the methods is more powerful in dealing with continuous spaces, particularly, when we have continuous action-spaces. In this category of methods, we instead learn a

**parameterized policy** that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

Now, this category of methods offers some amazing advantages for continuous action-spaces. Instead of computing learned probabilities for each of the many actions, we instead learn statistics of the probability distribution. For example, the action set might be the real numbers, with actions chosen from a normal (*Gaussian*) distribution. You can read more about Policy Gradient methods in Ref 1 (*Chapter 13*).

# 3. Optimal Behaviour

The short version:

- During training, we don’t know the optimal behaviour till the terminal states.
- Once the agent is trained, based on how accurate our trained agent is, we can estimate the optimal behaviour till the terminal states, with that accuracy.

However, a more interesting question here would be, “**During training, how far into the future we can look, and compute the estimates?**”. Interestingly , here’s another whole category of methods, known as **n-step methods**, which deals with this. Monte Carlo methods, are often known as infinite-step methods, and they come with minimal bias and maximal variance regarding their estimates. Methods like SARSA, Q-Learning, Expected SARSA, etc are 1-step methods, and they come with maximal bias and minimal variance regarding their estimates. It is the **n-step methods**, which comes with the best of both the worlds. You can combine n-step methods with DQN, which would make it a variation of DQN, popularly known as **N-step DQN**. You can read more about DQN and it’s variations in Ref 2 (*Chapter 5*), and you can read more about n-step methods in Ref 1 (*Chapter 7*).

- Conclusion

In this thread, I have extensively referred to the 2 sources which I mentioned towards the beginning of this post, since I have learnt from these 2 sources only.

For keeping this thread as concise as possible, I have skipped a lot of details as well, since it’s impossible for me to summarize them all, due to 2 reasons; first, this post will become infinitely long; and second, even I am a beginner in Reinforcement Learning as of now.

If you find some of the concepts mentioned in this thread confusing, please feel free to skip over them. The only thing you need to keep in mind is that **Reinforcement Learning** has a very well-built foundational structure, and on each of it’s aspects, there is considerable research going on right now. You are more than welcome to explore the intricacies of RL.

Cheers,

Elemento