As far we know SGD is stochastic in nature i.e. it picks up a “random” instance of training data at each step and then computes the gradient, making it much faster as there is much less data to manipulate at a single time.
Let’s consider a scenario where we have a total of 100,000 data points, and we are using SGD as our optimizer with an epoch size of 10,000. In this case, the model will randomly choose 10,000 data points (with replacement) from the training data for each epoch. Consequently, a significant portion of the data will be ignored.
The question arises whether this random sampling will introduce a high bias in the output, considering that a substantial amount of data is being disregarded.
I think you are misinterpreting the meaning of SGD. It is just minibatch gradient descent with a batchsize of 1. So you don’t randomly miss some elements. In your example, if there are 10^5 samples, then one “epoch” is a loop of SGD that hits every one of those samples one at a time. You may choose to randomly scramble the points before each epoch so that you process them in a different order, but you still hit every one in each “epoch” of training. Then the point is that you hope because of the more rapid convergence you can achieve a good solution with fewer total “epochs” of training. But note that is not guaranteed. The behavior is also more “stochastic”, so the it is not guaranteed that the convergence will be faster.
At least that is the definition of SGD that Prof Ng is teaching here. If you have found other definitions, then please give us a reference.
As Prof Ng emphasizes at many points in these courses there is no magic “one size fits all” solution that works best in all cases.