Hello @Sebastian_Rathmann,
You express things very clearly, and it is a pleasure working with you! 
While learning ML is a challenge, time management is also a challenge and is even more critical, so please stick to your plan. This place is accessible even after you completed the specialization, and I always look forward to seeing your new investigations in the future.
The only difference in terms of code between Batch GD, Mini-batch GD, and Stochastic GD (SGD) is how many percent of samples we use in one gradient descent (aka one step). By convention, each iteration (aka epoch) always consumes all samples, consequently, they have different number of steps per epoch. Below is an example with a training set of 2048 samples:
| Method |
number of samples per step |
number of steps per epoch |
| Batch GD |
2048 |
2048/2048 = 1 |
| Mini-batch GD |
32 |
2048/32 = 64 |
| SGD |
1 |
2048/1 = 2048 |
Note again that “one gradient descent” means “one step”, and “one iteration” means “one epoch”.
Therefore, two levels of loops are needed in general. The outer loop goes over the required number of epochs, whereas the inner loop goes over the required number of steps. Batch GD is a special case that there is always only one step per epoch, so the inner loop can be got rid of.
Random sampling is another topic independent of the version of GD. Generally, before we start an epoch, we shuffle the training data. This makes sure each step sees a different combination of samples. However, because Batch GD always uses 100% of samples, it is a special case that shuffling means nothing, and so it can be got rid of.
Above are why, for Batch GD, we do not have an inner loop for steps, and we do not have randomization for training data.
Above is also why, for my SGD, I used an inner loop for steps. However, I didn’t do randomization NOT because it is wrong to do so, but I chose not to do it, because I wanted to show you that my SGD can reproduce sklearn’s SGDRegressor result. In sklearn’s SGDRegressor, I have set its shuffle parameter to False. Even if I set it to True for sklearn’s SGD and use randomization in my SGD, there is no guarantee that sklearn and I shuffle the data in the same way. If we shuffle differently, I can’t expect the same result.
I would also like to give a very very brief introduction to the difference of impact by Batch GD, Mini Batch GD, and SGD. Since it is brief, it might be inaccurate in some situations. Also, you may not fully understand, but please just take my words for it, and think and criticize when the time comes in the future. Again, it is not completely accurate, so it is criticizable.
-
SGD performs the largest number of GDs, meaning it walks more steps in each epoch, and so it moves faster towards the goal (at least comparing to Batch GD in most cases)
-
The role of training data is sometimes overlooked by learners. The data decides the optimal weights. The data decides the cost surface. The data decides the next gradient descent step. Now, the three versions of GD will supply different amounts of samples in each step, so they see different sets of optimal weights, different cost surfaces, and different next steps.
-
If we wish our model to learn to perform to the best with respect to the whole training set, then batch GD is the safest approach. However, it is slow (and there are other downsides). SGD is fast, however, since it only performs GD on one sample at a time, its cost surface is most deviated from the surface that the batch GD would see. And since SGD can see a very different sample at each step, its cost surface also keeps changing, and thus making the steps stochastic (going back and forth, or sometimes called a noisy convergence path). Therefore, by switching from Batch GD to SGD, we are trading off a clean convergence path for speed. It is also therefore we have the Mini-batch GD as something in between the Batch GD and SGD. Mini-batch GD is the mostly used approach.
Lastly, when you have time in the future, please come back and tune the parameters to match your SGD with sklearn’s SGD. There are actually a lot of discussion points around those parameters. For example, do you expect applying shuffle = True makes any difference to the final result and why?
Really lastly, I don’t expect for any work or feedback since you have your schedule to stick to. From what I have seen so far, you have really done good works and you were improving your work. Please keep this momentum for your learning as it will help you through a lot of ups and downs.
Good luck, keep learning, and see you later.
Cheers,
Raymond