W2 Assignment - How learning rate decay does such a great job

Hi all,

Just making a note here, that as far as I can see was missing from the otherwise amazing explanations in the assignment’s notebook:

If anyone was wondering, how all three algorithms work so much better when we introduce learning rate decay, including the one with no optimization and the one with momentum (both doing not so well before):

It is because of the higher learning rate they started with (0.1 instead of the previous 0.0007).

It is possible to work with such a large initial value, because it is decaying so that the process can finally converge.

by the way, we can see at the graph that the final learning rate at the end of the process is still way larger than the initial one used without optimization (0.02 instead of 0.0007).

Running the original, no decaying and no optimization experiment with a 0.02 constant learning rate produces the following result:

No so bad either (accuracy actually a bit higher than with a high initial rate that decays over time)

Hey @Tal_Alon,
Thanks a lot for sharing your insights with the community.