Help needed: Breakout using DeepQ learning fails to converge in a reasonable timeframe

I am not 100% sure if this is a right place to ask questions like this, I apologize if not, please let me know and I will remove the post.

Anyhow, I have tried recreating the DQN algorithm, loosely based on the seminal paper from DeepMind, but even after several million iterations, algorithm doesn’t seem to improve noticeably.

Implementation features all the core principles underlined in the paper, as well as in many other implementations (what one would consider “standard practices”), as the purpose of the implementation was educational. This includes : experience replay buffer, utilization of two disparate neural networks to avoid the “moving target” issue (target network being updated every n-steps), epsilon-greedy exploration strategy (where I carefully monitored that epsilon doesn’t decay too quickly, but also that it was at points sufficiently low to see some improvement over randomness), even the reward augmentation, which I added later to speed up the process (I add extra reward for scoring/losing a point, not only for winning a game).

I have tried to visualize the CNN activations for certain frames and it seems as that the CNN layers learned to distinguish certain visual features, such as ball and the paddle, as shown on the image below:

Further more, upon looking deeper into the actual Q-value estimates coming out of my networks, I do see that the “better situations” tend to score more (e.g. ball about to be deflected by a paddle or the highest score - ball about the hit a brick and score a point), as seen below:

Conversely situation where ball is about to be missed tend to score lower, indicating that network did indeed learn something:

It is worth noting that, despite the Q-value looking “more reasonable”, the value for individual actions doesn’t seem more sensible, or increase significantly in variance, over the many iterations.

I am getting really desperate trying to debug this, as I am running out of ideas, and I would be extremely grateful for any advice what may be the cause of slow/no convergence.

Am I just being impatient? Is 3-5 million iterations still relatively “early” for this specific issue to converge? Since it is very time consuming to run this locally (and Google Collab will timeout) I have implemented an automatic store-load to continue the training every night.

More on the concrete implementation, together with my code you can find here

Edit: Further more, analysis of the action distribution as chosen by argmax(Q(s,a)) when epsilon is sufficiently small, shows that there is strong preference for LEFT or NOOP actions (stand still), leading to the paddle mostly being stuck at the lower left corner.

It is unclear to me where does this imbalance come from, however, if we plot the average Q value per action, as well as the distribution of margin between the Q value for the most preferred action and the second best action, we can see that most of the differences between the actions fall within a very small range, showing that there is still not much distinction between the action values (most commonly difference falls between 0 - 0.005), even tho the Q-values indicate quite correctly the value of the individual states. (close to losing a life leads to negative values, value increases as the ball is closer to hitting the brick)

It is noticeable, when replaying the episodes, that the value function is rather clear for the extreme states, i.e. close to missing the ball or close to hitting a brick, but often time, ball in between defaults to Q_value of 0.03, for a lot of states. This is reflected in the histogram of Q-values for the content of the buffer.

All in all, it seems that the agent did not yet undergo sufficient training to learn the value of all states, as well as to distinguish the between the optimal value for individual actions, even tho it has been over 5 million iterations, and it is a bit strange that after weeks of training on a slow machine reward function remains unchanged. This is possibly due to sub-optimal choice of hyper-parameters, even tho I do believe these were quite standard. (alpha=0.0005 in case anyone is wondering)

Thanks for posting this thread, I find it very interesting.

Unfortunately I don’t have any direct guidance to offer. Personally, DQN’s baffle me, and I have no idea how to debug one.

You might consider a look at this material from Keras on a similar implementation:

1 Like

hi @Amir_Pasagic

Don’t you think your Q value while ball hitting the paddle should be where it gets awarded and scored the least for better model convergence? looks like the model is doing reverse as it scored the least when the ball is about to hit the ground which would be end of game. So there is surely issue in implementation based on concept you explained (because your Q values is showing opposite result)

Can you share Google colab link, to see at your codes?

Also kindly mention briefly your idea of implementation and reasoning behind it.

Also you mentioned reward points, are reward when ball hits the other ball? irrespective if it is lower or higher to the ground?

Also did you include punish point when the ball would hit the ground as that would be end of game.

regards
DP

1 Like

Hey TMosh, thank you for your reply. I am glad you find it interesting, I do as well, and I think it is excellent to debug something like this using nothing by deduction and general intuition for how DQN works.

I agree it is quite hard to wrap your head around some of these concepts and I find it pretty unbelievable that something with such scarcity (and a delay) of a reward can converge into any meaningful result. Considering this, it may as well be case that with slightly sub-optimal parameters or network init, it can take well above 10 million iterations to see any changes in average reward.

To be fair, first few million iterations CNN activation just spew out random looking heatmaps.

PS. I did check the Keras implementation linked and have somewhat used it for sanity checking my code and tuning, tho tbf I have not done this in a super-pedantic fashion.

1 Like

Hi Deepti, thank you for your answer.

Firstly, environment in question does not output signal for “ball hit the paddle”. If it would, it could/should definitely be used for reward augmentation. What is output however, are lives and the results in term of points, which increases for hitting the brick.

At the moment, the reward is augmented to subtract half a point for loosing a life, on top of a default breakout environment reward which only rewards hitting a brick. This leads to negative total rewards and to reward generally being low/negative when the paddle is about tho miss a ball (or hit a floor as you say), as is seen on a third figure above, titled Bottom Q-Value Frames.

Conversely, higher Q-values are assigned to the frames directly before hitting the bricks as seen in Figure 2. Top Q-Value frames.

That is why ball being lower and traveling down is associated with lower scores, as it makes it more likely to lose a life leading to -0.5 reward, whereas ball traveling upwards is associated with higher rewards and the higher the ball (closer to hitting the brick) the higher the Q-value. Reward in both these cases is just as expected. (EDIT: in GIFs it is seen that if the ball is low and traveling down, Q-value is somewhat higher when ball is likely to hit a paddle then when it is to miss it, I will try to upload these later. since hitting the paddle cannot be rewarded per se since there is not signal indicating this event, there is no significantly higher Q-value associated with it, only implicitly for not loosing a life but hitting a brick later on n-frames later)

My apologies for missing to remark that what is shown in these figures (2 and 3) is actually a frame colored with a diff of 2 consecutive frames, indicating direction in which paddle or a ball are moving. Direction should be read “from blue towards red”.

Finally, as for the code, I do have a Collab notebook, however it does not indicate the most recent state of the code. Instead, pls check out github linked in the description, as it also contains intended implementation as well as the hyperparam tuning:

Hello @Amir_Pasagic,

This is interesting, but it’s almost a month ago, so are you still working on it? From the readme, you were implementing the paper there with some changes. I am wondering if you had a success without those changes?

The section 5 of the paper says:

I believe you don’t have it because you have a new action for each step:

I am asking it because I wonder if it helps more than just time-saving.

But first thing first, any success without those changes?

Cheers,
Raymond

1 Like

Also, I find that you imported ale_py in environment.py but never use it. The paper seems to be using, if I am not mistaken, the Arcade learning environment for seven of its games.

Was not using ale_py on purpose?

1 Like

First of all, thank you very much for taking the time to look into this, and even read into the code. I truly appreciate it.

I got stuck on this unfortunately for much longer then just a month, I had to shift my attention to other projects I started and on occasions I would let it run over night in hope that (due to perhaps - use of sub-optimal hyper-parameters) it just takes an order of magnitude longer to start converging. I may point out, training is extremely slow locally on my machine and I am not sure what would be the good alternatives that can have continuous runtime, unlike Colab.
I decided to revisit the topic a month ago, made some plots to aid the diagnostics and decided to share here, on Kaggle and Stack Exchange in hope someone may have experienced similar issues.

I did not, strictly speaking, try to recreate the paper implementation.
I was trying to learn by watching videos, reading the materials, checking out other implementations and once I felt I understood the core concepts just try to do it on my own, using more or less the same or very similar parameters. (for example, I checked this implementation by Keras for a loose reference Deep Q-Learning for Atari Breakout )
I changed parameters due to the very limited resources, which made the execution painstakingly slow, and I believed that I could make the convergence faster at the cost of performance.

In hindsight I see that my rookie mistake was belief that a.) models cannot be that sensitive to small changes in parameters and will only perform suboptimal, but will eventually converge b.) I have a right intuition for what small changes in parameters may lead to faster converging (albeit perhaps with not as strong of performance). You are right thought - my ReadMe may be misleading and I will fix that.

Regarding skipping the frames, I believed that this refers to the use every 4th frame which is done automatically by the environment wrapper. I have overlooked the use of action in such cases, I will certainly check what is the case there.

EDIT: Now I see that I am using “BreakoutNoFrameskip-v4”, whereas I was convinced all this time I was using the environment version that skips each 4 frames. Quite curious, I copy-pasted it from the libraries list of environments, but apparently selected the wrong one :man_facepalming: thank you for this catch! It is an interesting question - whether the only effect of this will be the speed of convergence, or it may effect the stability as well. Either way, my 5 million steps so far may have just became 1 million steps, so it may be a bit less curious that reward still did not increase noticeably :slight_smile:

I was previously experimenting with different ways one can run these environments, and I guess this was just an atavism from these previous attempts, thanks for pointing that out.

I honestly struggled understanding the differences between gym, gymnasium and ALE before (and to some extent I am not 100% clear) and when I wanted to make a wrapper to employ Reward Augmentation (to punish losing lives) I had a lot of issues with library compatibilities and what is in there now is a version that finally worked.

My understanding is that Gymnasium, which is a community maintained version of OpenAI’s Gym, basically runs a Py-ALE under the hood, but provides a layer of abstraction on top so one doesn’t have to interact with the emulator manually, for all the basic things to e.g. get frames etc.

Hello Mak @Amir_Pasagic,

I see. I was asking if there was any trial for the implementation of the paper (without changes) because it was my training as an experimentalist that it would be better to start off a known baseline and then study one change at a time. The baseline would be the paper for this case.

Even though ale_py is a dependency for gym, I will try to implement it with ale_py, as it looks closer to the paper’s. I only wonder how far I can train the model given the limited resources. I say if we can have a good baseline, then we can try the changes, understand the changes, and ultimately understand why your current model might not look promising.

I would suggest you to do the same at some point, just in case I may shift focus… :laughing:

This needs to be verified, and we can check whether two consecutive steps share some frames in their stacks, but this will not be my concern for now because I am going to use ale_py.

That’s one very interesting question I have been curious about :wink: :wink: :wink: This will take some experiment to verify, but if that’s the case, then the facts that (1) the paper stacks 4 frames for one input and (2) it takes one new action every 4 frames (for all but one of the games) may not be mere coincidence.

Cheers,
Raymond

I see. I was asking if there was any trial for the implementation of the paper (without changes) because it was my training as an experimentalist that it would be better to start off a known baseline and then study one change at a time. The baseline would be the paper for this case.

Unfortunately not, but this is certainly a valuable lesson here.

I have reverted to the following environment definition:

    env_id = "BreakoutNoFrameskip-v4"
    env = gym.make(env_id, render_mode=render_mode, frameskip=4)

as defined in the Breakout - ALE Documentation, and I quite quickly notice an improvement in reward. Now, I find it suspicious that this happened so quickly, so I need to first check whether this change affects how reward is calculated or is it really about agent performance.

When I stop on a breakpoint and execute a short helper script to animate 100 steps or so, I see that the performance is way less jittery and this leads to motion being seemingly much more coherent and less like Brownian motion. There is still a strong tendency towards left corner but, to be fair, it has been trained for many many episodes on (perhaps) imperfect data.

Just to give some quantification here - usually one run would result in rewards between -2 to 0 (-2 points subtracted for loosing all lives), sometimes would go up to 1 or 2. I have seen over many runs even 4-6, but since earlier today I see commonly episodes with rewards 9-12, so that is within 200 episodes significant improvement, perhaps just by the virtue of the agent being less erratic.

What I also noticed is that the environment, once initialized, starts off always with the same conditions, which in this last run is always shooting towards left corner. Not sure if this was the case also in the previous runs, but if so, I can see how training for million steps on episodes that always start that way would lead to a “left leaning” preference. Perhaps tihs is something where ale_py would provide more flexibility in initializing the environment.

Also, thank you again for your attention to this issue and commitment to review and even implement the code yourself. I know how scarce and precious time can be when one is working multiple things, so its very admirable you take time to dig into this :slight_smile:

2 Likes

Now I have started, too, so we are not very different (except I am too much behind you) and I will say the same thing to you :wink:

Cheers,
Raymond

Hi Raymond, I wanted to thank you once again for helping me solve this very frustrating issue :slight_smile: I am trying to learn systematically, implementing everything from the scratch before I move on to the more complex one (next I think Actor-Critic and slowly moving towards more advanced stuff like PPO). I wanna have a solid grasp of basics before I move on to working with libraries and Isaac Lab.

I tried many forums to find help in solving this issue, including Kaggle, Stack Exchange, but no one really had any idea, and this was an excellent catch. I am surprised the Frame Skip had such a profound effect on convergence. Looking at the gifs I created it does make sense, since agent rarely makes any smart sequence of moves that would lead to it not losing a life, thus reward is even more scarce and even more delayed (more actions occur in between). Perhaps it would converge eventually, but I would probably be a year older.

Now its improving very slowly , but I see individual rewards increase in the leading digit each few hours. Will post the results and solution, also update posts elsewhere, so whoever happens to run into a same problem has a hope of finding a solution :slight_smile: definitely a lesson learned here.

2 Likes

You are welcome, Mak @Amir_Pasagic!

I am also having fun here. My agent’s running average reward has been increasing, but pretty slowly. I think I can give it some time boost when I have time these two days by reviewing my code. It’s just the paper, training on my laptop (CPU :face_with_crossed_out_eyes:), without target Q network, and it’s a small network with 2 conv layers and 2 dense (as used by the paper). I admit that I wanted so much to add a TQN, but paper is paper (though I did make some small changes).

How’s your agent? Mine is catching up :wink:

Cheers,
Raymond

Btw, Mak, if you would like to, you might update your repo and see if I can spot some other thing. You know, I have studied the problem, so I might see differently. If you think it’s a good idea, let me know, but no rush on that, because I am going to make my agent train faster these two days first.

Restarted yesterday night, and the improvement of running average Reward looks accelerating now:

My CPU is giving me 100,000 timesteps per hour, so it’s going to take 4 days to run 10M steps as stated in the paper, but I will see how it goes and decide…

Cheers,
Raymond

Hey Raymond. First of all my apologies for a long delay in response. We had extensive flight test campaigns at work, followed by long holidays back home on the mediteranean sea, after which I have partnered up with some people on Kaggle to participate in the competition for contrastive learning on EEG data (for education purposes ofcourse, I have no hopes and aspirations of actually winning) :slight_smile:

Regarding the RL, agent behavior has definitely improved, however I feel that after days of overnight training it reached a certain plateau. Max Rewards I remark is usually no more then 25 (with life loss -2), but perhaps I am just not patient enough. Not sure what time frame I should be expected to wait.

I would appreciate a lot if you have time and will to look into it, I will try to push the latest code today or tomorrow and let you know. Thanks again! A.

1 Like

Sure. Let me know! I have been swapping between different tasks these days so it might take some time for me as well to read your code and respond, but it’s great to know that you are already pretty occupied, too, so I think I won’t need to feel guilty for being the same and for taking more time :wink: :wink: Hah-hah.

Has it been over? Would you mind sharing the link to it? Just curious.

Cheers,
Raymond

I pushed the changes, tho there should be only some minor differences regarding the FrameSkip. Also no rush, look into it whenever you have time.

EEG Data Competition is going on until the end of this month, but I plan to continue working on it after it is done, since I really just want to gain some experience and build portfolio :slight_smile:

There is plenty of information scattered around, as well as the Discord server, if you are interested in joining or discussing ideas, feel free to write me directly and I can get you up to date :slight_smile:

Best,
Amir