Stochastic Gradient Descent Vs ADAM

Hi,

I have found online a resource that describes ADAM as “an extended version of Stochastic Gradient Descent.” Is this true? I know we talked about Stochastic GD in the min-batch lecture and said that a mini-batch of 1 would represent it.

However, I don’t think we linked that to ADAM.

I think you are correct that those two ideas are independent. You have to be careful when you just do a google search for some DL concept: there are a lot of people writing Medium articles that sound sensible, but who don’t really have that much expertise and are just trying to establish an online corpus of work they can link to their profiles.

Or maybe the author was just trying to say that Adam works well in the SGD case. Of course that doesn’t mean it is not also useful with minibatch sizes greater than 1. Meaning that it’s just a question of interpreting the statement correctly.