Momentum clarification

Hi Sir,

Here is the below statement taken from programming assignment. Can u please help to clarify what does it mean?

Statement 1: Momentum usually helps, but with given small learning rate & smaller dataset, its impact is negligible. How impact is negligible ?

Statement 2: Usually works well even with little tuning of hyperparameters (except α) . Here does it means for Adam tunning alpha not necessary ?

Hi Anbu,

I’m going to make some assumptions because I’m not super sure which is the full context of these statements (ie. what lessons in particular).

From statement #1: So momentum might not be very useful or have negligible impact on small datasets, or when the learning rate is very small. I believe you can look at it this way… Momentum helps you converge faster by adding a boost to the gradient step. However if the gradient step is tiny, the boost will also still be tiny, hence the negligible in the statement.
Check this article that I find interesting:

About statement#2: This just refers to Adam optimization being very effective out of the box. Adam uses some clever tricks in order to choose a good step size on a per-parameter (ie. weights) and how quickly it is changing, instead of relying only on one, static learning rate such as alpha.

The paper is not super long and is pretty interesting:

Hope that helps.

