Course2 week 2 assignment

donnie1123 · August 25, 2021, 4:09pm

I have three questions.

Why using ADAM, the parameter b should take square?
s[‘db’ + str(l)] = s[‘db’ + str(l)]*beta2 + (1-beta2)*grads[‘db’ + str(l)]**2
Why the ADAM accuracy is lower than momentum with decay rate?
Is there an error? It says that “With SGD or SGD with Momentum, the accuracy is significantly lower than Ada”. But we only did mini-batch (with mini-batch size 64) instead of SGD in this assignment.

Thanks for your answer.

neurogeek · August 26, 2021, 2:52am

I’ll give it an attempt at answering your questions.

Check out the explanation for the Adam section (exercise 5) on the notebook:
This is to calculate ‘s’:

It calculates an exponentially weighted average of the squares of the past gradients, and stores it [...]

In the Adam paper, I think this is referred to as Update biased second raw moment estimate and it is part of the ADAM algorithm.

Why the Adam accuracy is lower than momentum with decay rate?
I think it’s just because SDG with momentum + decay rate is just a better fit than ADAM for this particular dataset. All these optimizations are (if I may) kinda black magic and people are still trying to figure why they work. For instance, although there are some theories and intuition around why learning rate decay works, there are still papers trying to explain it.

A good take away from this is that you should try different things and see what works best for your data. Adam, arguably, doesn’t need to be tuned so people like that, while other like to tweak things around and find lrDecay interesting and get good results. At the end, this should be up to you.

If I understand your question correctly, the ‘lower than Adam’ remark is referring to the results of exercice 6, in which lrDecay is not used and thus the SDG and SDG+momentum perform way worse than Adam.

Hope that helps!

donnie1123 · August 26, 2021, 3:58am

Thanks for your reply. I understand the first two questions now. But I think exercise 6, we use mini batch with size 64 instead of SGD. Again, thanks for your time and effort

neurogeek · August 26, 2021, 12:53pm

Oh, I see the confusion.
In these exercises you are actually doing SGD, SGD+momentum and Adam… All of them over mini -batches of 64. The idea here is to demonstrate the differences of training with different optimization techniques.

The mini-batch is just an efficient way to train large datasets by partitioning them in batches, as we learned in the previous Course!

donnie1123 · August 26, 2021, 2:34pm

In my opinion, SGD is just a special case of mini-batch gradient descent where m = 1. They are not equivalent since SGD does not use vectorisation in each (batch / sample). So in exercise 6, I think we are actually doing mini-batch / mini-batch + momentum. It is totally different with SGD / SGD + momentum since the difference between vectorisation. Thanks for your reply. Or maybe it is just my entanglement in definition. Thanks for your patience

neurogeek · August 26, 2021, 3:07pm

Hey @donnie1123,

Your point just flew over my head
Yes, I believe you are right. I think there is just confusion to refer to gradient descent, in general, as SGD. In these exercises it should be most accurate to mini-batch gradient descent, mb+momentum, instead of SGD.

I’ll request a change to avoid future confusion. Thanks for bringing this up and taking the time to help me understand the issue.

Cheers,

paulinpaloalto · August 26, 2021, 4:14pm

You’re right that Stochastic Gradient Descent (SGD) is equivalent to minibatch with a batch size of 1. But the point about vectorization is purely a performance question, right? The real point is that the smaller the batch size, the larger the statistical noise in the parameter updates. And in the limit of m = 1, you have the noisiest possible updates, but the point of momentum is to compensate for that. All the evaluations and qualitative statements here are talking at the accuracy of the result, not the performance (cpu time/memory/wall clock time required to train the model), meaning that vectorized or not is beside the point. And you are free to write your minibatch code in a non-vectorized way, if you’re feeling masochistic today. The mathematical results will be the same (modulo some possible rounding differences), but it will just take a lot longer.

donnie1123 · August 27, 2021, 4:52am

Thanks, I found that I misunderstand these two things, So in conclusion , the only difference between SGD and mini batch is the batch size. And here we should say it is mini batch. Thanks fory your clarification

Topic		Replies	Views
C2W2 - Adam Optimization Improving Deep Neural Networks: Hyperparameter tun	4	523	April 3, 2023
Course 2: Week 2 Exercise 6.3 & 7 Improving Deep Neural Networks: Hyperparameter tun	3	705	September 15, 2022
DLS Course 2 Week 2: adam is the worst algo Improving Deep Neural Networks: Hyperparameter tun	3	583	July 22, 2023
Adam Optimiztion Improving Deep Neural Networks: Hyperparameter tun	4	614	May 6, 2021
Some errors in the assignment of Week 2 ("Optimization Algorithms") Improving Deep Neural Networks: Hyperparameter tun week-2	4	33	March 14, 2025

Course2 week 2 assignment

Related topics