Adam optimization algorithm

Hi all,

I created an improvement of the adam optimizer algo.
In general, it uses the second derivative: It is looking how far it can go in the found direction by estimating the local minimum. With this approach, it can move 3-5 times further than learning rate in one step in case the minimum is far away.
However, I tried it with different topologies. WIth just 2 hidden layers, it significantly outperfors adam (validated it many times), but with 5 or more hidden layers, I don’t observe that advance against adam any more.
Now, I would like to understand the mechanism behind. Especially, if there is some overshoot necessary for adam to find the right direction, why is there a difference between 2 and 5 layers depth ?
Does somebody has a hint ?

Regards,
Andreas

You may have to look outside of the DLAI domain for help with this. DLAI is pretty much dedicated to ML applications, not the mathematical art of how optimizers work.