SGD vs GD in frameworks

In TensorFlow they use the name Stochastic Gradient Descent instead of Gradient Descent. So does TF update the parameters using one example at a time even though I make the batch_size=1(this divides the data in only one part so the the whole m examples get trained in every epoch).

If you configure to supply m samples per batch, then in each round of update, it will be trained on m samples. The name doesn’t override that configuration.

Note the definition for batch_size (source):


What I’m actually wondering is when look into the document of Keras SGD it says it is gradient descent so why every framework tf, Keras, torch choose to say stochastic gradient descent? Because when I’m giving all m examples it will be normal GD.

Hello @alperenunlu,

They didn’t even use the term “stochastic” in that document, do you realize it? Not even once.

Instead, they call it “Gradient descent (with momentum) optimizer”.

I can’t defend for them their decision to use that name, and I don’t know how they named it, but one thing for sure is that, that SGD object doesn’t control the batch size at all. No matter what name you give to it, no matter it is stochastic gradient descent, mini-batch gradient descent, or batch gradient descent, it will just take whatever amount of data you give it in doing the gradient descent update.


1 Like