I am not 100% sure if this is a right place to ask questions like this, I apologize if not, please let me know and I will remove the post.
Anyhow, I have tried recreating the DQN algorithm, loosely based on the seminal paper from DeepMind, but even after several million iterations, algorithm doesn’t seem to improve noticeably.
Implementation features all the core principles underlined in the paper, as well as in many other implementations (what one would consider “standard practices”), as the purpose of the implementation was educational. This includes : experience replay buffer, utilization of two disparate neural networks to avoid the “moving target” issue (target network being updated every n-steps), epsilon-greedy exploration strategy (where I carefully monitored that epsilon doesn’t decay too quickly, but also that it was at points sufficiently low to see some improvement over randomness), even the reward augmentation, which I added later to speed up the process (I add extra reward for scoring/losing a point, not only for winning a game).
I have tried to visualize the CNN activations for certain frames and it seems as that the CNN layers learned to distinguish certain visual features, such as ball and the paddle, as shown on the image below:
Further more, upon looking deeper into the actual Q-value estimates coming out of my networks, I do see that the “better situations” tend to score more (e.g. ball about to be deflected by a paddle or the highest score - ball about the hit a brick and score a point), as seen below:
Conversely situation where ball is about to be missed tend to score lower, indicating that network did indeed learn something:
It is worth noting that, despite the Q-value looking “more reasonable”, the value for individual actions doesn’t seem more sensible, or increase significantly in variance, over the many iterations.
I am getting really desperate trying to debug this, as I am running out of ideas, and I would be extremely grateful for any advice what may be the cause of slow/no convergence.
Am I just being impatient? Is 3-5 million iterations still relatively “early” for this specific issue to converge? Since it is very time consuming to run this locally (and Google Collab will timeout) I have implemented an automatic store-load to continue the training every night.
More on the concrete implementation, together with my code you can find here










