When practicing with optimization algorithms in week 2’s assignment, I observe that adam performs the worst among the three algorithms in Section 6.3 - Mini-Batch with Adam and its progress in cost optimization does not make much sense (see the screenshot).
Did anyone observe anything like that?
(My code in update_parameters_with_adam passed the test).
The accuracy I get from Adam is 94%, which is greater than the 71% accuracy I see with plain minibatch or with momentum. Note that 71% is also higher than you get even from Adam.
If your functions pass the test cases in the notebook, the only thing I can theorize is that you modified some other part of the notebook to create this effect. You might want to start with a clean notebook and just “copy/paste” over your completed code and see if that makes a difference. There is a procedure for that documented on the DLS FAQ Thread.
There was a bug in my code that was not noticed by the update_parameters_with_adam test because the test initialized v and s with zeros. When I fixed the bug, I got the expected 94% accuracy for Adam – my original issue is resolved.
As a side note, I’ve also found that one can get the same >94% accuracy with the gradient descent by increasing the learning rate – as it turns out, the value in the notebook is too small. Moreover, it is clear that learning rates in the gradient descent method and in Adam have different scales, so a simple comparison with numerically identical (and unoptimized) learning rates does not really prove the advantage of one method over the other.