Confusion on Target Variable Deep Reinforcement

Hey Michael @mosofsky ! Thank you for sharing your work!

Comments for the first Total Points plot:

  1. TQN lags behind and follow QN
  2. QN can be unstable even converged. Thank you TQN.

Questions…

  1. Why are the 1st and the 2nd Total Points plot different in early episodes? The 2nd plot looks more reasonable to me as the robot should hardly get good points at the beginning.
  2. In the 2nd plot, the light blue and the pink lines are for QN and moving average of QN respectively?

My sharing:

  1. You didn’t use the loss but the total points, so I used the loss, and a couple of others, but only for the QN because we already know that TQN follows QN and is more stable than QN.
  2. Instead of saving a trained NN after each episode, my readings are obtained over the course of training in each episode and get aggregated. Therefore, the readings are NOT always from using the best action, because the learning epsilons are not zero.
Explain:
  1. Grey area implies robot crashed in those episodes
  2. blue line is 50 episodes moving average
  3. x-axis is episodes, from 0 to 1999
Case 1: Learning epsilon = 1.0, fully exploring

Six plots arranged as

Column 1 Column 2
Total Displacement Number of Timesteps
Total Points Reward per timestep
Total Loss Loss per update

Observations:
  1. number of timesteps & total displacement small at first for crashing early, then increases by improvement against early crashing, and decreases for more optimal actions
  2. 2 significant changes occured at around episode 500 and 1200. The first change is a sharp improvement, whereas the second change is a recoverable short-term degradation
  3. Reward per timestep increasing, so it’s getting more efficient.
Case 2: learning epsilon = 0.5, half exploring

Six plots arranged as

Column 1 Column 2
Total Displacement Number of Timesteps
Total Points Reward per timestep
Total Loss Loss per update

Observations

  1. Total loss looks to be oscillating at first and that should be due to the change in number of timestamp, and that’s why the same oscillation does not appear in the loss per update plot
  2. No short-term degradation observed as in case 1.
  3. Crashing improves slower than case 1.

Cheers,
Raymond

Code changes:

def agent_learn(experiences, gamma):
    ...
    return loss #Added
# Section 9
ep_buffer = [] #Added
for i in range(num_episodes):
    ts_buffer = [] #Added
    ...
       if update:
            ...
            loss = agent_learn(experiences, GAMMA) #Changed
        else: #Added
            loss = 0. #Added
        ts_buffer.append([np.sqrt(((state[:2] - next_state[:2])**2).sum()), reward, loss, float(update)]) #Added
    ...
#     if av_latest_points >= 200.0: #Changed
#         print(f"\n\nEnvironment solved in {i+1} episodes!") #Changed
#         q_network.save('lunar_lander_model.h5') #Changed
#         break #Changed
    ep_buffer.append((t, env.game_over or abs(state[0]) >= 1.0, total_points, *np.array(ts_buffer).sum(axis=0))) #Added
# Plots
t, crashed, total_points, distance, reward, loss, update = np.array(ep_buffer).T
lines = {
    'number of timesteps': t,
    'total points': total_points,
    'total displacement': distance,
    'total loss': loss,
    'reward per timestep': reward/t,
    'loss per update': loss/update
}

from matplotlib import pyplot as plt
fig, axes = plt.subplots(3, 2, figsize=(16, 16), sharex=True)
for (name, line), ax in zip(lines.items(), axes.flatten()):
    ma = np.cumsum(line)
    ma[50:] = ma[50:] - ma[:-50]
    ma = ma[50-1:] / 50
    ax.twinx().fill_between(np.arange(len(line)), crashed, 0, color='grey', alpha=.2)
    ax.plot(line, lw=0.5, color='red')
    ax.plot(ma, color='blue')
    ax.set_ylabel(name)

Thank you for the further insight @rmwkwok . Let me reply to your questions:

I’ll answer them bottom to top:

In the 2nd plot, the light blue and the pink lines are for QN and moving average of QN respectively?

Yes, correct. The 2nd plot is not one I designed, it’s from the original Jupyter notebook in the section “9 - Train the Agent” where the code says “Plot the point history”.

Why are the 1st and the 2nd Total Points plot different in early episodes? The 2nd plot looks more reasonable to me as the robot should hardly get good points at the beginning.

I don’t know. Maybe it’s random? It also doesn’t make sense to me that the Q-Network would perform well in the early episodes. It does make sense that Q-Network performs much better than Target Q-Network early on. But why did Q-Network achieve in the 250 range early on? That is curious indeed. I guess I could rerun a few times to see how reproducible this is (but I’m getting off a train right now and I don’t know if Jupyter notebooks pause when I put my computer to sleep).

I see. At least from my result, the Q-N does not score anything positive in early episodes. RL is interesting, isn’t it?

@rmwkwok I ran my evaluation code two more times and in these cases, the Q-Network does not score anything positive in early episodes. So I guess that was just randomness in the first chart I generated.

download
download-1

It’s great! Mystery resolved, convergence still there, instabilities after convergence still possible. How is the DayDramer Robotic Dog stuff going? I see a paper in the link you provided, are you going to read it?

Yes I read the DayDreamer paper but I think I need to learn the following things to understand it:

  1. Actor/Critic models
  2. Recurrent Neural Network

So far, I’ve tried understanding it by imagining how I would adapt the Lunar Lander assignment to use the DayDreamer approach. I think that would entail something like this:

Pretend the Lunar Lander simulator is a real Lunar Lander that only has 1 chance to land safely.
Thus we have to reduce to 1 episode and attempt to learn a good model in the inner for-loop.

Introduce a new thread for the training to be done in parallel while the Lunar Lander is falling toward the moon.

In the learning thread, we need to train two models: World Model, Actor Critic.

To train the World Model, I think we need a neural network that takes as input the state, action and outputs the next state and reward. I’m not sure what it’s used for because it seems similar to the Actor model.

In the Actor Critic model, the Actor seems similar to the one we built in the assignment. I think it takes as input the state and action and predicts the next state and the return. I could not understand the Critic’s input and output from the paper but conceptually I think its role is to give the Actor model feedback somehow.

My understanding of DayDreamer is just at the beginning. If you can offer any links to help me learn the background knowledge, I would appreciate that. I’ve certainly tried googling that stuff myself but the material I’ve come across is either too high-level or assumes too much prior knowledge. Andrew Ng’s lectures were just right for me. I wish he had a lecture on all these concepts. I’m considering taking the Deep Learning Specialization because I think it covers some of the topics such as Recurrent Neural Networks.

Hi Michael @mosofsky ,

I wanted to focus on this so I had to wrap up my another project that was close to deadline, so thank you for the patience. Below is my understanding and I hope it will help you get some idea:

Real world replaces simulator; real world model replaces the physics we define in the simulator.

In the real world, if we jump, we will move upward because this is how the world works, and as a result, we will have a new height A. In simulator, we implement physics rules such as Newtonian mechanics so that given a state and an action (that implies acceleration), the simulator is able to compute our next state which has a higher value of height B than the original state (that implies moving upward).

We can position the world model as capturing how the world works, which means that, given the same state and the same action (as we give to the simulator in the above example), a perfect world model will be able to compute the same next state (or the same next height) as height A (not height B)

The real-world-model-based training is supposed to be better than the simulator-based training because the latter is too theoretical that always require a lot of assumptions that can’t be met in the real world. The former should do the best to capture all the physics to a very good details. An excellent real-world model can completely replace any physics rules we implement in a simulator. And this is actually what the DayDreamer is doing - it uses the World Model output to train the actor-crtic algorithm (if we had used the simulator approach, we would have used the output from the simulator to train the actor-crtic algorithm).

So, the world model is a completely new stuff that is not existed in our Lunar Lander which (not verified) I think is using a set of pre-defined physics.



For the actor-crtic algorithm, I suggest you to go through this notebook. You can download and run it on your computer. It also provides a pretty nice and simple explanation about the concepts. We can talk about your understanding on those if you want to. This example should be a good starter and we will still be some steps away from the DayDreamer version but it is a good start I think.

Also I am wondering whether you find this kind of notes helpful to you for understanding. You know, there are maths part and there are non-maths path. Both are crucial but I am not sure whether both of them are interesting to you. It can be uninteresting but it is useful.



Lastly, I think you would agree that RNN itself isn’t the core of reinforcement learning. This is important, of course, because you are interested in how the DayDreamer works. To learn the real world, this DayDreamer uses a RNN (analogous to the transition matrix in the language of Kalman filtering which is mentioned also in the paper as what the authors’ world model approach is following), and it is because the authors need to use the weights inside the RNN for an arbitary number of times during the training process so RNN is the natural choice for this task. My description here isn’t complete in explaining what RNN is, but just to give you some simple ideas of why it is used. RNN worths a course.

Should we focus on the actor-critic part first?

Raymond

PS: Having said how better a real world model is than a set of simulator physics rules, simulator is still very important because it is cheaper. I think they won’t train Daydreamer dog to jump across a deep river because it takes many failures to learn the right way. Also, the physics that the real world model learns is limited to the environment we provide to the robot. Sometimes we just can’t offer all possible and practical environments to the robot. The DayDreamer dog in the video of that link you provided is only demonstrating that it worked on a flat floor. There are limitations in either way.

Thank you @rmwkwok , I really appreciate all the insights, explanations, and links to other resources! At this point, I think the best way for me to learn is to take the Deep Learning Specialization.

I tried reading the materials you sent but I quickly got hung up on basic questions. For example, in the Actor-Critic model, how does the training data for the Actor differ from that of the Critic? I assume they both learn from the replay buffer. Is the Actor supposed to start off proposing random actions? And the Critic has been trained with some minimal amount of training data from the replay buffer to evaluate the Actor’s initial action? And then is the Actor supposed to learn from the Critic’s evaluations of its actions? If that is correct, then it doesn’t make sense to me why the Actor wouldn’t just be trained on some of the replay buffer’s data. But then it’s trained the same as the Critic so they’d probably just make the same recommendations. See, I think I would just be better off taking Andrew Ng’s next course so I can learn the basics most efficiently before trying to delve into DayDreamer.

From what little I did learn about DayDreamer, I can see it’s not the panacea I had originally believed. It could not effectively solve the lunar lander the way I described because it wouldn’t go through a full learning cycle without the opportunity to crash the lunar lander a bunch of times. Now I think DayDreamer is probably best suited for tasks that can be performed from start to finish over and over with minimal negative consequences. Learning to walk has low consequences, i.e. just falling down. That’s nothing compared to crashing a lunar lander. I think I was a bit too dazzled and mesmerized by the video of the dogbot learning to walk in an hour! It made me think DayDreamer could do more than it can.

Lastly, I’d like to respond to your point about simulators vs. the real world. I think it might be impossible to build perfect models because of what I’ve learned about Chaos Theory. According to my understanding, Chaos Theory says perfect models would require an infinite level of precision. Infinity is not attainable by definition, I guess. The models we have of the world are only useful within limited circumstances. And I think Chaos Theory says that models are guaranteed to be useful only in limited circumstances. Sure, we can make them more precise, maybe even sufficiently precise for the given application. But there will always be more room for improvement because the goalpost is infinitely far away. So, it’s good to train in the real world but a model is helpful for bootstrapping.

I just wanted to write back to thank you, but at this point, I think I’ve got to duck out of this conversation because I need to learn more of the basics in the Deep Learning Specialization.

Hey @mosofsky, you are welcome! Enjoy the DLS :wink:

Raymond