Stochastic Gradient Descent Definition

Syabith_Umar_Ahdan · September 29, 2022, 12:55am

Hello all,

Please CMIIW. From the course’s videos, I learned that Stochastic Gradient Descent (SGD) is gradient descent with a mini-batch size of one. It has something to do with when the parameters are updated using gradient descent. And its alternatives are mini-batch gradient descent and full-batch gradient descent.

However, after playing around with TensorFlow, I see that SGD is categorized as an optimizer.
I also see multiple articles comparing it to Adam. It is as though my initial understanding of SGD is incomplete. In TensorFlow, I thought I just need to set the batch size to 1 to apply SGD, but then, why there is an optimizer called SGD? I also see articles comparing SGD with Adam. While according to my understanding, we can use SGD (batch size = 1) and Adam (as optimizer) at the same time, what I’ve read so far implies that one needs to choose between SGD OR Adam for their neural network model.

I would appreciate it if anyone can point me in the right direction on this.

gent.spah · September 29, 2022, 7:28am

The DLS specialization is doing a pretty good job in explaining these, maybe you need a bit more time.

SGD and ADAM are gradient descents both, or called optimizers in TensorFlow. They basically help the model to converge to an optima computing the gradients of the fitting functions and try to minimize them.

You may read this post, it has a lot of info with regards to these.
Gradient Descents

Syabith_Umar_Ahdan · September 29, 2022, 9:41am

Thanks for your reply and link to the article. Just FYI, I finished the course already, but when I try to deepen my understanding, I stumble upon this question.

Reading the article, It seems to confirm that my understanding from the course is correct regarding SGD and Adam. But then my initial questions remain unanswered.

In TensorFlow, I thought I just need to set the batch size to 1 to apply SGD, but then, why there is an optimizer called SGD? It implies that I can have a batch size of 16, yet still uses SGD as an optimizer. How could that be? What is the difference between setting the batch size to 1 vs using the optimizer SGD in Tensor Flow? I also see articles comparing SGD with Adam. While according to my understanding, we can use SGD (batch size = 1) and Adam (as optimizer) at the same time for training, what I’ve read so far implies that one needs to choose between SGD OR Adam for their neural network model.

Topic		Replies	Views
Stochastic Gradient Descent Vs ADAM Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	569	April 23, 2022
DLS Course 2, wk3 programming assignment optimizer for "Train the Model" Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	554	July 20, 2021
Course2 week 2 assignment Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	592	August 27, 2021
Choice of SGD instead of Adam Sequences, Time Series and Prediction week-module-4	1	556	July 10, 2022
Stochastic Gradient Descent Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	573	June 5, 2021

Stochastic Gradient Descent Definition

Related topics