Help needed: Breakout using DeepQ learning fails to converge in a reasonable timeframe

Amir_Pasagic · September 2, 2025, 1:54pm

Hi Deepti, thank you for your answer.

Firstly, environment in question does not output signal for “ball hit the paddle”. If it would, it could/should definitely be used for reward augmentation. What is output however, are lives and the results in term of points, which increases for hitting the brick.

At the moment, the reward is augmented to subtract half a point for loosing a life, on top of a default breakout environment reward which only rewards hitting a brick. This leads to negative total rewards and to reward generally being low/negative when the paddle is about tho miss a ball (or hit a floor as you say), as is seen on a third figure above, titled Bottom Q-Value Frames.

Conversely, higher Q-values are assigned to the frames directly before hitting the bricks as seen in Figure 2. Top Q-Value frames.

That is why ball being lower and traveling down is associated with lower scores, as it makes it more likely to lose a life leading to -0.5 reward, whereas ball traveling upwards is associated with higher rewards and the higher the ball (closer to hitting the brick) the higher the Q-value. Reward in both these cases is just as expected. (EDIT: in GIFs it is seen that if the ball is low and traveling down, Q-value is somewhat higher when ball is likely to hit a paddle then when it is to miss it, I will try to upload these later. since hitting the paddle cannot be rewarded per se since there is not signal indicating this event, there is no significantly higher Q-value associated with it, only implicitly for not loosing a life but hitting a brick later on n-frames later)

My apologies for missing to remark that what is shown in these figures (2 and 3) is actually a frame colored with a diff of 2 consecutive frames, indicating direction in which paddle or a ball are moving. Direction should be read “from blue towards red”.

Finally, as for the code, I do have a Collab notebook, however it does not indicate the most recent state of the code. Instead, pls check out github linked in the description, as it also contains intended implementation as well as the hyperparam tuning:

github.com/apasagic/reinforcement-learning

deep_q_learning/breakout/README.md

main

# Deep Q-Learning for Atari Breakout 🎮

This part of the repository contains an implementation of **Deep Q-Learning (DQN)** to play Atari's Breakout using reinforcement learning techniques.  It is generally based on the ideas presented in the seminal paper ["Playing Atari with Deep Reinforcement Learning"](https://arxiv.org/abs/1312.5602), adhering to the network structure and size, but with a few changes introduced, such as reward augumentation and a minor parameter tweeks.

Implementation utilizes the following structure in which two networks are used for evaluating the value function, to avoid the so-called 'moving target' issue, as is the common practice, as well as the replay buffer to avoid overfitting on earlier samples:

<img width="600" height="350" alt="DQNarchitecture-1" src="https://github.com/user-attachments/assets/f1315dd0-094f-4ec2-978c-30bb7fdc42ca" />

Structure of the network is a commonly used, consisting of 3 CNN layers, followed by a fully connected layers (neuron numbers differ from implementation):

<img width="700" height="200" alt="1_yfrF2jnI3zspkZELq2rw9g" src="https://github.com/user-attachments/assets/cedb41ee-1b74-4f72-a386-a9de1bd748bf" />

Implementation also features a possibility to store and load model checkpoints, in case of timeout, if ran on limited resources, since an execution takes a fair ammount of time.

---

## 📌 Features
- **Deep Q-Network** Agent (4 CNN + 2 Dense layers)
- Uses **OpenAI Gym** Breakout environment
- Augumented reward (add reward for a score point, not only a won episode)

This file has been truncated. show original

Topic		Replies	Views
Confusion on Target Variable Deep Reinforcement Unsupervised Learning, Recommenders, Reinforcement week-module-3	28	983	September 15, 2022
Having trouble understanding how DQN converges Unsupervised Learning, Recommenders, Reinforcement week-module-5 , coursera-platform	0	14	December 14, 2025
Deep Q nework AI Discussions ai-discussions , project	6	213	February 14, 2024
Learning the Q Function Unsupervised Learning, Recommenders, Reinforcement week-module-3	16	579	July 13, 2023
Reinforcement learning - deep q learning - lunar lander AI Discussions ai-discussions	0	31	July 15, 2025

Help needed: Breakout using DeepQ learning fails to converge in a reasonable timeframe

Related topics