In the case of batch-gradient descent, there is a trajectory or smooth line, and the line is not smooth when it is not batch-gradient descent.
Again, in the case of feature scaling, when the values of w1 and w2 are not scaled, then the gradient descent oscillates, which means it can’t find a direct path to reach the global minimum. But when w1 and w2 are rescaled, it can find a direct or smooth path to reach the minimum.
I feel the exact reason for the smooth path in both cases is for changing the next step into small-scale rather than jumping in nonconsectuive points with a big gap.
If I am wrong, then correct me.
The differences are:
- the magnitude of the features, which are used to compute the gradients.
- the magnitude of the learning rate, which use the gradients to modify the weights.
Hello @farhana_hossain
Interesting induction! I don’t think you are wrong, but maybe it can go a little bit further than that. For example,
-
why does non batch GD (i.e. stochastic GD / mini-batch GD) more tend to oscillate than batch GD?
-
why does feature scaling help prevent oscillation?
The lectures have explained the 2nd question but perhaps not the 1st one.
Cheers,
Raymond
Hello, I am correcting my statement.
In the case of batch gradient descent, the line is smooth because all steps are taking the same set of training data. However, in the case of non-batch gradient descent, the line is not smooth because each step uses a different combination of the selected data sets, so the zigzag occurs. The scene is illustrated here: Variants of Gradient Descent Algorithm | Types of Gradient Descent
For feature scaling, the smoothness of the line occurs due to initiating a small change between consecutive steps, as I stated earlier.
Thanks
@farhana_hossain, agreed!
A small suggestion to your statements is that, besides arguing whether it is the same batch or different batches, it is more informative to argue with the batch size.
The reason is simple, in the non batch GD (btw, let’s call it the mini-batch GD), I can have every mini-batch equals to the full batch minus a random sample. In this case, their optimization paths wouldn’t be too different.
I believe you have read similar arguments to the following that, the smaller the mini-batch size, the larger the variance among the mini-batches. At a training step, because its cost surface is defined by both the cost function (which is a fixed form) and the mini-batch in that training step (which is variable ), the variance among the mini-batches makes the variance among the next optimization steps. I had took some time to think about it when I came across this idea, so please give yourself enough time to think about it too, and see what or where it would lead you to.
Cheers,
Raymond