SGD vs GD in frameworks

If you configure to supply m samples per batch, then in each round of update, it will be trained on m samples. The name doesn’t override that configuration.

Note the definition for batch_size (source):

Cheers,
Raymond

What I’m actually wondering is when look into the document of Keras SGD it says it is gradient descent so why every framework tf, Keras, torch choose to say stochastic gradient descent? Because when I’m giving all m examples it will be normal GD.

Hello @alperenunlu,

They didn’t even use the term “stochastic” in that document, do you realize it? Not even once.

Instead, they call it “Gradient descent (with momentum) optimizer”.

I can’t defend for them their decision to use that name, and I don’t know how they named it, but one thing for sure is that, that SGD object doesn’t control the batch size at all. No matter what name you give to it, no matter it is stochastic gradient descent, mini-batch gradient descent, or batch gradient descent, it will just take whatever amount of data you give it in doing the gradient descent update.

Cheers,
Raymond

1 Like