Hi all,
Just making a note here, that as far as I can see was missing from the otherwise amazing explanations in the assignment’s notebook:
If anyone was wondering, how all three algorithms work so much better when we introduce learning rate decay, including the one with no optimization and the one with momentum (both doing not so well before):
It is because of the higher learning rate they started with (0.1 instead of the previous 0.0007).
It is possible to work with such a large initial value, because it is decaying so that the process can finally converge.