Stochastic Gradient Descent Definition

Hello all,

Please CMIIW. From the course’s videos, I learned that Stochastic Gradient Descent (SGD) is gradient descent with a mini-batch size of one. It has something to do with when the parameters are updated using gradient descent. And its alternatives are mini-batch gradient descent and full-batch gradient descent.

However, after playing around with TensorFlow, I see that SGD is categorized as an optimizer.
I also see multiple articles comparing it to Adam. It is as though my initial understanding of SGD is incomplete. In TensorFlow, I thought I just need to set the batch size to 1 to apply SGD, but then, why there is an optimizer called SGD? I also see articles comparing SGD with Adam. While according to my understanding, we can use SGD (batch size = 1) and Adam (as optimizer) at the same time, what I’ve read so far implies that one needs to choose between SGD OR Adam for their neural network model.

I would appreciate it if anyone can point me in the right direction on this.

The DLS specialization is doing a pretty good job in explaining these, maybe you need a bit more time.

SGD and ADAM are gradient descents both, or called optimizers in TensorFlow. They basically help the model to converge to an optima computing the gradients of the fitting functions and try to minimize them.

You may read this post, it has a lot of info with regards to these.
Gradient Descents

Thanks for your reply and link to the article. Just FYI, I finished the course already, but when I try to deepen my understanding, I stumble upon this question.

Reading the article, It seems to confirm that my understanding from the course is correct regarding SGD and Adam. But then my initial questions remain unanswered.

In TensorFlow, I thought I just need to set the batch size to 1 to apply SGD, but then, why there is an optimizer called SGD? It implies that I can have a batch size of 16, yet still uses SGD as an optimizer. How could that be? What is the difference between setting the batch size to 1 vs using the optimizer SGD in Tensor Flow? I also see articles comparing SGD with Adam. While according to my understanding, we can use SGD (batch size = 1) and Adam (as optimizer) at the same time for training, what I’ve read so far implies that one needs to choose between SGD OR Adam for their neural network model.