DL week2: Gradient Descent with Momentum

Deep Learning - week 2 Gradient Descent with Momentum

Can anyone tell me why larger beita leads to less oscillation for the progress to minimum? Thx. :slight_smile:

Welcome to the community, @Jiacheng_Cao!

With a higher \beta previous gradients are weighted more which means that your gradients are more smooth. Imagine you have some noise or a gradient outlier, then a higher \beta helps to mitigate its impact because the previous gradients are well considered. You can think of a moving average which is used here to smoothen the gradient and therefore reduce oscillations.

Note that if you would increase your \beta even higher, you might have too much of momentum and swing over your actual optimum you want to reach.

Bear in mind that other gradient-based optimization algorithms can be suitable dependent on how your data and the optimization cost space looks like:


Best regards

@Jiacheng_Cao, I moved your question from general category to the DL section.

Please let me know if your question is answered or if anything is open.

Best regards