Help needed: Breakout using DeepQ learning fails to converge in a reasonable timeframe

Hi Deepti, thank you for your answer.

Firstly, environment in question does not output signal for “ball hit the paddle”. If it would, it could/should definitely be used for reward augmentation. What is output however, are lives and the results in term of points, which increases for hitting the brick.

At the moment, the reward is augmented to subtract half a point for loosing a life, on top of a default breakout environment reward which only rewards hitting a brick. This leads to negative total rewards and to reward generally being low/negative when the paddle is about tho miss a ball (or hit a floor as you say), as is seen on a third figure above, titled Bottom Q-Value Frames.

Conversely, higher Q-values are assigned to the frames directly before hitting the bricks as seen in Figure 2. Top Q-Value frames.

That is why ball being lower and traveling down is associated with lower scores, as it makes it more likely to lose a life leading to -0.5 reward, whereas ball traveling upwards is associated with higher rewards and the higher the ball (closer to hitting the brick) the higher the Q-value. Reward in both these cases is just as expected. (EDIT: in GIFs it is seen that if the ball is low and traveling down, Q-value is somewhat higher when ball is likely to hit a paddle then when it is to miss it, I will try to upload these later. since hitting the paddle cannot be rewarded per se since there is not signal indicating this event, there is no significantly higher Q-value associated with it, only implicitly for not loosing a life but hitting a brick later on n-frames later)

My apologies for missing to remark that what is shown in these figures (2 and 3) is actually a frame colored with a diff of 2 consecutive frames, indicating direction in which paddle or a ball are moving. Direction should be read “from blue towards red”.

Finally, as for the code, I do have a Collab notebook, however it does not indicate the most recent state of the code. Instead, pls check out github linked in the description, as it also contains intended implementation as well as the hyperparam tuning: