Performance drop during DeepQ training

Hey everyone :slight_smile:

I have been working on several RL projects, trying to learn thru implementing and playing around, and currently I am on learning DeepQ and actor-critic methods.

I find it interesting that during training on a cart-pole example, performance first peaks near maximum (episodes last near max 500 steps) and then significantly drops afterwards, as seen on an image below:

I have heard someone say in a youtube video that this occurs in actor-critic methods (I guess same goes for DeepQ), but with no explanation why?
I would like to gain an intuition about what is going on in there? Could it be that my buffer is too small, so one bad episode “pushes” the weights and biases further from the optimum?

Besides the obvious solution of using an EarlyStop and saving best performing weights, what are the solutions to this issue. (which other, more advanced methods address this issue)

Related Notebook:

hi @Amir_Pasagic

can I know how reward is scaled in your data? I mean on the basis of time or action space size?

Also I noticed your both network uses activation function linear where as in optimizer you are using ADAM.

did you try using stochastic gradient descent with momentum if your model data is linearly spread.

Thank you for your answer!

It is not scaled, it is typically +1 for each step, encouraging the longer episodes as the algorithm gets better at balancing the pole. Since reward was either +1 or 0 (when done), I didnt see any need to for scaling it.

I did try in the later itteration modifying the reward to give it more incentive to keep the pole near 0° and the cart near origin, by further shaping the reward as follows:

# Calculate normalized "closeness" to center and upright
angle_bonus = 1.0 - (abs(pole_angle) / (0.2))   # 0.2 rad ≈ 11.5 degrees, a typical limit
position_bonus = 1.0 - (abs(cart_pos) / (2.4))  # Cart position limit is usually ±2.4

# Clip to avoid negative bonuses
angle_bonus = max(angle_bonus, 0)
position_bonus = max(position_bonus, 0)

# Scale bonus values
reward_shaping = 0.25 * angle_bonus + 0.25 * position_bonus

# Augment the reward
reward += reward_shaping.

“Also I noticed your both network uses activation function linear where as in optimizer you are using ADAM.”

Is it not common to use ADAM when using linear activation functions?
Why does that matter, especially since hidden layers use ReLu, in case non-linearity makes a difference?

I did not play much with the optimizer/solver to be honest, I just heard/read somewhere that this is a general pitfall of such methods, but not sure why is this so.

i notice you are not including negative bonus, so are there no negative data point in the model training??

Also honestly it is not about ADAM Optimiser I was questioning when I asked the question, I even wanted to ask why you chose a linear activation function. The reason I am asking if you data is spread or distributed linearly then it’s okay, but seeing the output, I noticed your data is not distributed linearly. You could use a sigmoid linear function as your reward scale is based on +1 or 0 based on the longer episode.

Sharing a GitHub repo related cartpole reinforcement learning

Also for question why SGD is a better choice over Adam optimization if you have linear data distribution, because of it generalization to update weight parameters after each data points and can handle larger dataset.

No, but this is typically a nature of a cart-pole problem, it is usually handled like that. My addition here is only augmentation w.r.t. pole angle and a cart position.
I am not sure if this “biased” data should make any issues here, but I never seen it being addressed in this rather ubiquitous example.

I have found this, seemingly very related post:

https://ai.stackexchange.com/questions/28079/deep-q-learning-catastrophic-drop-reasons

thanks for sharing this, I actually wanted to point this overfitting problem.

I notice in your next state steps you are state space size, Why didn’t you approach this from (window size-1) where each previous step, detect the next step. it will actually help your model network to also helps understand and compare the previous pattern with the current pattern, helping you in long term trends and short term fluctuations, and provide you insight on the data points flat minima at the end of episodic iterative training.

Not sure what are you referring to, could you please point out which part of the code? Is it about updating the Qnetwork every 5 steps?

Typically Target Network is update every n-steps, to avoid moving target issues (in my case every 25 episodes, I think). Besides that I also do not update QNetwork on each step but on each 5th step as the execution is quite slow otherwise.

This is typically not done, but my execution would get painstakingly slow so I reduce number of fits.

Not sure if this was what you were refering to, but since data is randomly taken from a buffer, it should affect correlation between the patterns. (perhaps I misunderstood what you wre going at)

in your input shape what is state space size?

Also use alpha of 0.1 and gamma of 0.9 learning rate for better reward where one is focused on long term rewards and another on immediate rewards