When practicing with optimization algorithms in week 2’s assignment, I observe that adam performs the worst among the three algorithms in Section 6.3 - Mini-Batch with Adam and its progress in cost optimization does not make much sense (see the screenshot).
Did anyone observe anything like that?
(My code in update_parameters_with_adam passed the test).
Comments/suggestions would be much appreciated.
That learning rate is quite small, maybe if you start with a larger number and update/reduce it as epochs increase you might get better results.
Are you sure you didn’t alter any of the other parts of the notebook other than the code you need to write? Here’s what I see from the Adam section:
The accuracy I get from Adam is 94%, which is greater than the 71% accuracy I see with plain minibatch or with momentum. Note that 71% is also higher than you get even from Adam.
If your functions pass the test cases in the notebook, the only thing I can theorize is that you modified some other part of the notebook to create this effect. You might want to start with a clean notebook and just “copy/paste” over your completed code and see if that makes a difference. There is a procedure for that documented on the DLS FAQ Thread.
Thanks to all who responded!
There was a bug in my code that was not noticed by the update_parameters_with_adam test because the test initialized v and s with zeros. When I fixed the bug, I got the expected 94% accuracy for Adam – my original issue is resolved.
As a side note, I’ve also found that one can get the same >94% accuracy with the gradient descent by increasing the learning rate – as it turns out, the value in the notebook is too small. Moreover, it is clear that learning rates in the gradient descent method and in Adam have different scales, so a simple comparison with numerically identical (and unoptimized) learning rates does not really prove the advantage of one method over the other.