Hi all,
I created an improvement of the adam optimizer algo.
In general, it uses the second derivative: It is looking how far it can go in the found direction by estimating the local minimum. With this approach, it can move 3-5 times further than learning rate in one step in case the minimum is far away.
However, I tried it with different topologies. WIth just 2 hidden layers, it significantly outperfors adam (validated it many times), but with 5 or more hidden layers, I don’t observe that advance against adam any more.
Now, I would like to understand the mechanism behind. Especially, if there is some overshoot necessary for adam to find the right direction, why is there a difference between 2 and 5 layers depth ?
Does somebody has a hint ?
Regards,
Andreas