Hey Michael @mosofsky ! Thank you for sharing your work!
Comments for the first Total Points plot:
- TQN lags behind and follow QN
- QN can be unstable even converged. Thank you TQN.
Questions…
- Why are the 1st and the 2nd Total Points plot different in early episodes? The 2nd plot looks more reasonable to me as the robot should hardly get good points at the beginning.
- In the 2nd plot, the light blue and the pink lines are for QN and moving average of QN respectively?
My sharing:
- You didn’t use the loss but the total points, so I used the loss, and a couple of others, but only for the QN because we already know that TQN follows QN and is more stable than QN.
- Instead of saving a trained NN after each episode, my readings are obtained over the course of training in each episode and get aggregated. Therefore, the readings are NOT always from using the best action, because the learning epsilons are not zero.
Explain:
- Grey area implies robot crashed in those episodes
- blue line is 50 episodes moving average
- x-axis is episodes, from 0 to 1999
Case 1: Learning epsilon = 1.0, fully exploring
Six plots arranged as
Column 1 | Column 2 |
---|---|
Total Displacement | Number of Timesteps |
Total Points | Reward per timestep |
Total Loss | Loss per update |
Observations:
- number of timesteps & total displacement small at first for crashing early, then increases by improvement against early crashing, and decreases for more optimal actions
- 2 significant changes occured at around episode 500 and 1200. The first change is a sharp improvement, whereas the second change is a recoverable short-term degradation
- Reward per timestep increasing, so it’s getting more efficient.
Case 2: learning epsilon = 0.5, half exploring
Six plots arranged as
Column 1 | Column 2 |
---|---|
Total Displacement | Number of Timesteps |
Total Points | Reward per timestep |
Total Loss | Loss per update |
Observations
- Total loss looks to be oscillating at first and that should be due to the change in number of timestamp, and that’s why the same oscillation does not appear in the loss per update plot
- No short-term degradation observed as in case 1.
- Crashing improves slower than case 1.
Cheers,
Raymond
Code changes:
def agent_learn(experiences, gamma):
...
return loss #Added
# Section 9
ep_buffer = [] #Added
for i in range(num_episodes):
ts_buffer = [] #Added
...
if update:
...
loss = agent_learn(experiences, GAMMA) #Changed
else: #Added
loss = 0. #Added
ts_buffer.append([np.sqrt(((state[:2] - next_state[:2])**2).sum()), reward, loss, float(update)]) #Added
...
# if av_latest_points >= 200.0: #Changed
# print(f"\n\nEnvironment solved in {i+1} episodes!") #Changed
# q_network.save('lunar_lander_model.h5') #Changed
# break #Changed
ep_buffer.append((t, env.game_over or abs(state[0]) >= 1.0, total_points, *np.array(ts_buffer).sum(axis=0))) #Added
# Plots
t, crashed, total_points, distance, reward, loss, update = np.array(ep_buffer).T
lines = {
'number of timesteps': t,
'total points': total_points,
'total displacement': distance,
'total loss': loss,
'reward per timestep': reward/t,
'loss per update': loss/update
}
from matplotlib import pyplot as plt
fig, axes = plt.subplots(3, 2, figsize=(16, 16), sharex=True)
for (name, line), ax in zip(lines.items(), axes.flatten()):
ma = np.cumsum(line)
ma[50:] = ma[50:] - ma[:-50]
ma = ma[50-1:] / 50
ax.twinx().fill_between(np.arange(len(line)), crashed, 0, color='grey', alpha=.2)
ax.plot(line, lw=0.5, color='red')
ax.plot(ma, color='blue')
ax.set_ylabel(name)