Confusion on Target Variable Deep Reinforcement

According to dnn algorithm we repeatedly try create an artificial training set to which we apply supervised learning where the input x = (s,a) and the target is constructed using Bellman’s equations with the notion Q function will eventually improve… But here the target itself is the approximation, and not a real valued y, so how do even learning takes place when the target itself is constructed using the random initialization of Q function

1 Like

Hi @Yash_Singhal, but the reward is real. This is the most important piece of information we give to the learning process.


1 Like

I’m still confused about the relationship between the Target Network and the Q-Network.

For example, in C3_W3_A1_Assignment, what should I expect to see if a call utils.create_video() with the Target Network (i.e. target_q_network) instead of the Q-Network (i.e. q_network)?

1 Like

What did you see when you tried that? What’s your current understanding or expectation about what you are confusing? The assignment explained the purpose for having the target Q network, what do you think about that purpose?

Interesting. I was confused by this point as well. So you are saying even though the max Q(s’, a’) is random, the target (y) still has the reward component, i.e. y = R + max Q(s’, a’),and thus should lead to accurate Q values over time across many samples via gradient descent. Am I understanding correctly?

1 Like

I would say, it doesn’t matter whether max Q is random or not, we still have Reward which is always independent of our model - be it a very well trained or randomly initialized one.

A poor max Q decides an action that can be poor, after carrying out the action, we receives a reward which is the source of knowledge about the environment the robot is interacting with.

We train the NNs based upon this piece of true knowledge and some inaccurate estimates, and hope the NNs to converge to useful ones if not even close to the truth.

I believe you have tried the assignment and saw for yourself that this idea can work.

However, we should be careful not to take things for granted, not any NNs can work for any environment and any robot in any way of training. NN is not magic.

PS: @samuel_varghese, sorry I mistakenly edited your post. I am transiting and was not aware I clicked the edit button, but I should have edited it back.

1 Like

@rmwkwok I read the explanation for Target Network a few times but I could not quite grasp it. I hoped by running utils.create_video() with the Target Network I would see something that would help me understand the relationship between the Target Network and the Q-Network. For example, I hoped to see that one was more accurate than the other because one had been trained on more data than the other. But I did not see a difference, perhaps because one is only slightly more accurate. I feel like I’m not understanding the main purpose of the Target Network.


I see. I think it will help if I quote some of the assignment’s provided code in my explanation. I will write the reply after around 3 hours.

I am actually transiting but I am here because in earlier time I had to quickly wrap up a conversation with another learner before I started to transit, so I wanted to check if the learner had any followup question or not.

Sorry but you may come back later.


Hi Michael @mosofsky

First, above quoted is the purpose for having the target network - to avoid instabilities. We can think about it this way - this time, we train our Q network with a samples set A covering a certain states and actions, such learning does not affect just those states and actions, but all because every weights in the NN get updated. Next time, we train the NN with samples set B covering different states and actions, and by the same reason, learning about the states and actions in B can completely change the NN’s original behavior about the states and actions of A. If the change is too dramatic, it is a bad idea - this is a kind of instabilities we want to avoid.

To achieve that, we need to get involved something reluctant to change in our system design - which is the target Q Network. The Q-Network learns, but the Target Q-Network also learns, only that the latter learns in an indirect way:

Screenshot from 2022-08-17 10-56-33

It updates its weights by combining its weights and the updated weights of the Q-Network, and the ratio of this combination is controlled by a parameter \tau which is usually very very small, so that each time it remains most of itself but only take a tiny bit from the Q-Network. This makes the target Q-Network reluctant to change.

With this reluctance, the learning can become more stable and better preserve what’s been learnt.

In order to see for yourself, or get some experience of your own. on how useful having the target Q-Network is, you can experiment different values of \tau yourself. If you set it to 1, then the target Q-Network is effectively equal to the Q-Network, or you may say we are effectly abandoning having a separate, different Target Q-Network. So you can see how the learning will work out without the presence of a reluctant-to-change Target Q-Network. To do this, you may open the assignment, then don’t run any code, click “File” > “Open”, and open “”, check out the update_target_network function, and you will see the update formula there, and you may adjust the \tau value to something very different (between 0 and 1, inclusive). Save your change, then go back to the notebook, and then run the code. Each time after you change the \tau value, you need to restart the kernel of your notebook once, and run codes from the top of the notebook.

I also strongly suggest you to try this out yourself, besides comparing the difference you mentioned, also see what it will look like at different sizes of \tau. It is a very good way to learn by having an expectation first, see the effect, and update your understanding when needed.

If you do try, I look forward to your sharing :wink:


1 Like

Thank you @rmwkwok , I will run some experiments with different values of τ.

Can you clarify one thing in the following sentence from C3_W3_A1_Assignment:

It seems like it’s referring to 3 networks:

  1. 𝑄̂ -Network (i.e. “Q hat Network”)
  2. target 𝑄̂ -Network
  3. 𝑄 -Network (i.e. “Q Network”)

I know there are only supposed to be 2 networks so I’m wondering if the first one is a typo? In other words, should the sentence say “every 𝐶 time steps we will use the 𝑄 -Network [i.e. “Q Network”] to generate the 𝑦 targets” instead of “every 𝐶 time steps we will use the 𝑄̂ -Network [i.e. “Q hat Network”] to generate the 𝑦 targets”? If not, then is the 𝑄̂ -Network short-hand notation for the “target 𝑄̂ -Network” OR is it shorthand for the “𝑄 -Network”?

Hi Micheal, I think the assignment did define Q-hat as the target Q, so they should refer to the same thing.


Thank you @rmwkwok , I understand much better now after re-watching the Reinforcement Learning lectures, re-reading your comments, trying TAU=1, and reading this PyTorch explanation of DQN.

Based on my current understanding, I think the following is the answer to my question about the relationship between the Q-Network and the Target Q-Network.

Here is a graph of what I would expect to see if we recorded average scores for Q-Network and Target Q-Network as i approaches num_episodes:

Here is a narrative description of the graph above:

In the end, the Q-Network and the Target Q-Network perform about the same because they have been influencing each other’s development through the algorithm. They both start with random weights. Upon the first update event, Q-Network receives training and then the Soft Update mechanism gently nudges the Target Q-Network weights towards the Q-Network’s weights. After that first training, Q-Network probably performs much better than Target Q-Network for quite some time. But at some point, Target Q-Network starts to improve. Target Q-Network never receives direct training, it is merely updated via Soft Update. But Target Q-Network influences Q-Network’s development since Target Q-Network’s values stand in for the correct answer in the loss function. Thus Q-Network influences Target Q-Network’s weights through Soft Update and Target Q-Network influences Q-Networks weights during neural network training through the loss function. As we reach the end of the process (i.e. as i approaches num_episodes), Q-Network and Target Q-Network perform similarly. At this point, Target Q-Network is not changing much; it reaches a plateau. Target Q-Network’s accuracy is pretty stable through subsequent trainings whereas Q-Network may occasionally outperform Target Q-Network and occasionally underperform Target Q-Network, episode to episode.

In conclusion, the evolution of Q-Network and Target Q-Network reminds me of the answer to the classic paradox, “which came first, the chicken or the egg”. In short, they developed simultaneously (Chicken or the egg - Wikipedia).

I may not have this all right, but does my understanding seem better now? Thanks in advance for any feedback.


Hey Michael @mosofsky! I think this is going to be one of my favourite posts. I am going to bookmark this.

Your plot does deliver a few interesting features, and your descriptions have spoken a lot of those, and let me just quote your words:

Your previous description has supported this outcome, and leads me to this analogy: the Q network hopes to learn aggressively but, at some point of the time, will have to calm down when no surprising, unseen-type of samples comes in; the target Q network keeps dragging the Q network from behind.

In other words, as the learning starts, all samples are unseen before, the Q network goes crazy learning them, the target Q tries to slow down but the Q network can still be too hyperactive about the new stuff. Then as time goes, new stuff becomes rare, target Q’s dragging force becomes more dominant, and Q network starts to settle down better.

Yes, the Q-network isn’t going to be completely at rest.

This is interesting. I think the egg is like the Q-network which brings us major evolutionary change; the chicken is like the target Q network that teaches/guides the chick how the world works from the parent chicken’s perspective/experience. The reward is like the influence from the world that the chick preceives.

Another interesting feature from your plot is about the Q*-N. I suppose this refers to the true Q-network. I like your plot to keep it far from the Q-N or the target Q-N. I guess in most if not all cases, we can’t know what the Q*-N actually is, otherwise we would have 100% searched through the state and action space and all possible paths that connect them, but this is impossible because there can be an infinite number of paths when even either the state or the action values are continuous or unbounded or both.

Indeed this infeasible brute-force-type 100% searching work actually should be a motivating force behind we using NN to approximate a useful function. It is useful in that we can make good decisions from it.

Lastly, I think your description has given us a good understanding of how the Q-network and the target Q-network can be designed to work. Although sometimes they don’t behave as we expect or we might need to tune the models and some relevant parameters like TAU to make it work better, I think we agree that this Q-network Target-Q-network pair has a very good potential in delivering something useful for us.

Thank you for your sharing Michael!


1 Like

Thank You for the discussion guys. It really helped put things in perspective and order

Thus if I want to respond to my question originally asked, it really dosent matter that the Q function is randomly initialized because we have lots of lots training examples and each example would consist of the reward which is really the actual information that is responsible for learning. Plus we could have avoided target q neural network entirely but since the q itself is prone to changes it can lead to numerical instability. So we basically take output from target neural network and after some episodes do a soft update

Hi @Yash_Singhal, the only thing I am not sure about in your response is this:

In the assignment, we use the output from the QN to make decision, and we use the outputs from both the QN and the TQN to train the QN and right after the training, we update the TQN. Each episode has many time steps, and the QN is trained every a certain number of timesteps. You may read the assignment again for details.

Above is how the assignment works. However, nobody can stop you from changing how things work in your own RL system :wink:


Yeah you are correct …

1 Like

Hi @rmwkwok , thank you so much for your feedback. Reinforcement Learning is one of the most interesting topics in the class but it was also the most challenging for me. Someday I would like to understand the Reinforcement Learning used for the robotic dog that learned to walk in 1 hour without a simulator (see DayDreamer).

Just to clarify, in the sentence below, did you mean to say “the Q network goes crazy learning”?

I suppose it would be possible to create an actual version of the graph I hand-drew by adding a bit of code to the Jupyter notebook. I wonder how close my drawing is. And I wonder if there’s a convergence point for Q-Network and Target Q-Network and whether that would be a better stopping point as opposed to av_latest_points >= 200.0.

Thank you. It was my mistake. Corrected it.

Then you need a measure for average score. Or instead of average score, we need something that can represent improvements. Can you think of a few candidates? Of course, I am not saying that the score isn’t a good candidate :wink: Just always good to have a couple more for reference.

I always think there is only one way to find out, when it comes to question like this. There is no general answer to this kind of stuff.

Discussion without action is pretty boring, right? We won’t go very far by staying at thoughts.

Here is what I tried when I had a question to myself - how would action decision space change over iterations. Knowing that nobody can tell me the answer, I added/changed around or less than 10 lines of code in the assignment to:

  1. remove the stopping point and force it to run 2000 episodes
  2. set epsilon to 0.5 so it both explores and exploits
  3. record explicitly the Q-N’s best action at the beginning of each episode under a series of pre-defined state - I won’t go into details about the definition, but the angle (one of the elements of the state vector) has a range from -45 degree to 45 degree

The above idea is to see how the robot reacts to tilted posture. Below is the result, x axis is the number of iterations; y axis is the angle. Color represents actions. Purple means no action; Blue/Green/Yellow means firing the right/main/left engine.


A few points:

  1. From the plot, as it iterates, the robot learnt better to balance from the tilting.
  2. It became more tolerated to small angles later as it appears to be more willing to do nothing there (firing engine receives negative reward)
  3. Just in case, please really don’t generalize this result to anything more than the test itself. It’s extremely dangerous. So if you try to generalize my result and apply to your question, I won’t comment on that.
  4. Your question requires your series of experiments.


Thank you for the inspiration @rmwkwok .

Response: how about using the loss function for the y-axis since that combines the Q-Network and Target Q-Network returns value? Is that the alternative you were thinking of?

Challenge accepted, @rmwkwok ! I modified the code to generate the hypothetical graph I had sketched and here it is:


Before generating this chart, I wrote down the following hypotheses:


  1. End performance of both networks will be similar but they’ll basically converge
  2. The convergence of Q-Network and Target Q-Network is a plateau
    2.1 The convergence would be a more efficient stopping point than av_latest_points >= 200.0
  3. Early in the episodes, Q-Network’s performance will be much higher
  4. Q-Network will vary much more wildly than Target Q-Network

I think the data support each hypothesis; I’ll go through them one-by-one:

1. End performance of both networks will be similar but they’ll basically converge

After 250 episodes, the Q-Network and Target Q-Network lines mostly overlap apart from a notable blip at 1750 episodes. Generally, the Q-Network’s average scores after the plateau wiggle around the Target Q-Network more, as expected (hypothesis 4).

2. The convergence of Q-Network and Target Q-Network is a plateau

Ultimately, the convergence became a plateau after 750 episodes but from 250 to 750 episodes, the two networks were improving together at about the same rate. So my hypothesis was a bit wrong in not predicting convergence prior to the plateau.

2.1 The convergence would be a more efficient stopping point than av_latest_points >= 200.0

Yes but it’s not enough to look for convergence. The convergence also has to be a plateau.

For the most part, terminating at a converged plateau does seem more efficient than stopping after av_latest_points >= 200.0 because the additional training resulting from waiting for a converged plateau produced a model that can achieve a better score than 200, i.e. around 275 points on average.

Also, the converged plateau criterion allows you to stop the episodes loop with confidence that additional training will not improve the model much.

Thus I think stopping when Q-Network and Target Q-Network converge at a plateau produces a model with the high accuracy at minimum cost incurred for training.

3. Early in the episodes, Q-Network’s performance will be much higher

Indeed, during the first five or so episodes, the Q-Network shot up to an average score of 200 wheras the Target Q-Network took time to catch up.

4. Q-Network will vary much more wildly than Target Q-Network

Over the course of the 2000 episodes, the Q-Network’s performance varied quite a bit compared to Target Q-Network. The difference was most noticeable in the early episodes. Around 250 episodes, they started moving together but the Q-Network always wiggled more. As mentioned previously, Q-Network jumped suddenly and considerably away from Target Q-Network around 1750 episodes. This suggests that to approximate the optimal action-value function, Q*, even more training would be beneficial if the stakes are high enough. Perhaps this contradicts the recommendation in section 2.1 to stop the process when Q-Network and Target Q-Network converge at a plateau. So I think what I should say now is that the converged plateau is a more efficient criterion than av_latest_points >= 200.0 but not optimal (and finding optimal is probably hard or impossible).

If anyone needs to try to reproduce the graph or improve upon this analysis, below are the modifications I made to the code:

First, I made the following changes to the code in the section “9 - Train the Agent” in order to save the Q-Network and Target Q-Network after each of the 2000 episodes.

    # We will consider that the environment is solved if we get an
    # average of 200 points in the last 100 episodes.
# TODO: compare_q_n_and_target_q_n uncomment the following lines after finish experiment     
#     if av_latest_points >= 200.0:
#         print(f"\n\nEnvironment solved in {i+1} episodes!")
#         break
# TODO: compare_q_n_and_target_q_n comment-out the following lines after finish experiment'lunar_lander_model_q_network_episode' + str(i) + '.h5')'lunar_lander_model_target_q_network_episode' + str(i) + '.h5')

Here is the output of that code:

Episode 100 | Total point average of the last 100 episodes: -107.38
Episode 200 | Total point average of the last 100 episodes: -38.535
Episode 300 | Total point average of the last 100 episodes: -51.13
Episode 400 | Total point average of the last 100 episodes: 117.30
Episode 500 | Total point average of the last 100 episodes: 242.00
Episode 600 | Total point average of the last 100 episodes: 254.56
Episode 700 | Total point average of the last 100 episodes: 251.96
Episode 800 | Total point average of the last 100 episodes: 265.15
Episode 900 | Total point average of the last 100 episodes: 254.42
Episode 1000 | Total point average of the last 100 episodes: 254.49
Episode 1100 | Total point average of the last 100 episodes: 270.53
Episode 1200 | Total point average of the last 100 episodes: 263.61
Episode 1300 | Total point average of the last 100 episodes: 270.21
Episode 1400 | Total point average of the last 100 episodes: 267.25
Episode 1500 | Total point average of the last 100 episodes: 271.90
Episode 1600 | Total point average of the last 100 episodes: 271.81
Episode 1700 | Total point average of the last 100 episodes: 270.89
Episode 1800 | Total point average of the last 100 episodes: 271.13
Episode 1900 | Total point average of the last 100 episodes: 262.76
Episode 2000 | Total point average of the last 100 episodes: 276.02

Total Runtime: 1383.20 s (23.05 min)

Existing plot “Plot the point history” from section “9 - Train the Agent”:

Here is a video of the Lunar Lander: Lunar Lander Reinforcement Learning after 2000 episodes of training - YouTube

Finally, below is the code which generates the chart “Q-Network vs. Target Q-Network Avg. Points over Episodes”. This code needs to run after the code in section “9 - Train the Agent”.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

def plot_network_comparison(q_network_reward_history, target_q_network_reward_history, rolling_window):
    xs = [x for x in range(len(q_network_reward_history))]
    q_network_df = pd.DataFrame(q_network_reward_history)
    q_network_rollingMean = q_network_df.rolling(rolling_window).mean()

    target_q_network_df = pd.DataFrame(target_q_network_reward_history)
    target_q_network_rollingMean = target_q_network_df.rolling(rolling_window).mean()

    plt.figure(figsize=(10,7), facecolor='white')
    q_network_line = plt.plot(xs, q_network_rollingMean, linewidth=4, color='blue', label='Q-Network')
    target_q_network_line = plt.plot(xs, target_q_network_rollingMean, linewidth=2, color='orange', label='Target Q-Network')

    text_color = 'black'
    ax = plt.gca()
    plt.title("Q-Network vs. Target Q-Network Avg. Points over Episodes", color=text_color, fontsize=20)
    plt.xlabel('Episode', color=text_color, fontsize=15)
    plt.ylabel('Total Points', color=text_color, fontsize=15)
    yNumFmt = mticker.StrMethodFormatter('{x:,}')
    ax.tick_params(axis='x', colors=text_color)
    ax.tick_params(axis='y', colors=text_color)
    plt.legend(loc="lower right")

def run_simulation(network):
    total_points = 0
    env = gym.make('LunarLander-v2')
    state = env.reset()
    for t in range(max_num_timesteps):

        # From the current state S choose an action A using an ε-greedy policy
        state_qn = np.expand_dims(state, axis=0)  # state needs to be the right shape for the q_network
        q_values = network(state_qn)
        action = np.argmax(q_values.numpy()[0])
        # Take action A and receive reward R and the next state S'
        next_state, reward, done, _ = env.step(action)

        state = next_state.copy()
        total_points += reward

        if done:
    return total_points

from tensorflow import keras

q_network_points = []
target_q_network_points = []
for i in range(num_episodes):
    restored_q_network = keras.models.load_model('lunar_lander_model_q_network_episode' + str(i) + '.h5', compile=False)
    restored_target_q_network = keras.models.load_model('lunar_lander_model_target_q_network_episode' + str(i) + '.h5', compile=False)

    if (i+1) % 100 == 0:
        print(f"\rEpisode {i+1} | Total q_network_points {len(q_network_points)}")
    target_q_network_points, 10)
1 Like