Hi @donnie1123,
I’ll give it an attempt at answering your questions.
- Check out the explanation for the Adam section (exercise 5) on the notebook:
This is to calculate ‘s’:
It calculates an exponentially weighted average of the squares of the past gradients, and stores it [...]
In the Adam paper, I think this is referred to as Update biased second raw moment estimate
and it is part of the ADAM algorithm.
- Why the Adam accuracy is lower than momentum with decay rate?
I think it’s just because SDG with momentum + decay rate is just a better fit than ADAM for this particular dataset. All these optimizations are (if I may) kinda black magic and people are still trying to figure why they work. For instance, although there are some theories and intuition around why learning rate decay works, there are still papers trying to explain it.
A good take away from this is that you should try different things and see what works best for your data. Adam, arguably, doesn’t need to be tuned so people like that, while other like to tweak things around and find lrDecay interesting and get good results. At the end, this should be up to you.
- If I understand your question correctly, the ‘lower than Adam’ remark is referring to the results of exercice 6, in which lrDecay is not used and thus the SDG and SDG+momentum perform way worse than Adam.
Hope that helps!
3 Likes
Thanks for your reply. I understand the first two questions now. But I think exercise 6, we use mini batch with size 64 instead of SGD. Again, thanks for your time and effort
Oh, I see the confusion.
In these exercises you are actually doing SGD, SGD+momentum and Adam… All of them over mini -batches of 64. The idea here is to demonstrate the differences of training with different optimization techniques.
The mini-batch is just an efficient way to train large datasets by partitioning them in batches, as we learned in the previous Course!
In my opinion, SGD is just a special case of mini-batch gradient descent where m = 1. They are not equivalent since SGD does not use vectorisation in each (batch / sample). So in exercise 6, I think we are actually doing mini-batch / mini-batch + momentum. It is totally different with SGD / SGD + momentum since the difference between vectorisation. Thanks for your reply. Or maybe it is just my entanglement in definition. Thanks for your patience
1 Like
Hey @donnie1123,
Your point just flew over my head 
Yes, I believe you are right. I think there is just confusion to refer to gradient descent, in general, as SGD. In these exercises it should be most accurate to mini-batch gradient descent, mb+momentum, instead of SGD.
I’ll request a change to avoid future confusion. Thanks for bringing this up and taking the time to help me understand the issue.
Cheers,
1 Like
You’re right that Stochastic Gradient Descent (SGD) is equivalent to minibatch with a batch size of 1. But the point about vectorization is purely a performance question, right? The real point is that the smaller the batch size, the larger the statistical noise in the parameter updates. And in the limit of m = 1, you have the noisiest possible updates, but the point of momentum is to compensate for that. All the evaluations and qualitative statements here are talking at the accuracy of the result, not the performance (cpu time/memory/wall clock time required to train the model), meaning that vectorized or not is beside the point. And you are free to write your minibatch code in a non-vectorized way, if you’re feeling masochistic today. The mathematical results will be the same (modulo some possible rounding differences), but it will just take a lot longer.
1 Like
Thanks, I found that I misunderstand these two things, So in conclusion , the only difference between SGD and mini batch is the batch size. And here we should say it is mini batch. Thanks fory your clarification