In the video “Understanding Mini-batch Gradient Descent” Prof. Andrew said that “stochastic gradient descent won’t ever converge, it’ll always just kind of oscillate and wander around the region of the minimum.” But in the assignment of week 2, we used it as an optimization method.
A screenshot from the video is attached in which Andrew showed that it won’t converge. Also, a screenshot from the assignment is also attached.
In the lecture video, as you pointed out, Andrew says, " As stochastic gradient descent won’t ever converge, it’ll always just kind of oscillate and wander around the region of the minimum". But Andrew continues to say after that sentence, “But it won’t ever just head to the minimum and stay there”.
I guess what the assignment meant by “reach convergence” is that, SGD may not necessarily provide you with THE minimum value, but a good minimum value.
Somewhere in C2, you’ll go through an assignment where you’ll visualise the results driven by all of these techniques (it could be the lab in question, I don’t quite remember).
Based on the task at hand the results one gets from them, you can try different techniques and select the one where you are satisfied with the results.
I think Andrew talks about using SGD with the help of some other mechanisms to get even better results.