Hello @Lizhang_Qin, let’s go back to the basic.
Each time, gradient descent moves a weight by this amount: \alpha \frac{\partial{J}}{\partial{w}}, so for it to be moved to the minimum within an epoch, we need (1) enough # of moves, (2) right size of learning rate \alpha and (3) right gradient size. Before I go into the (1-3), let’s look at the following slide.
The gradient is essentially the red line which depicts the slope at our current location. Noted that it’s only telling us how much J is expected to reduce by changing w, it doesn’t tell us how large the change should be in order to get to the minimum, so that’s why we need to move step-by-step and thus discussing (1-3) becomes relevant.
(1) # of moves. It is equal to the # of gradient descents. In each epoch, we split our samples into batches and conduct one gradient descent per batch, so if we have infinite # of samples, we then have infinite # of batches, thus we can do infinite # of gradient descents and make infinite # of moves. In this case we can get to the minimum in one epoch, however, we don’t have infinite # of samples, so we can’t guarantee that.
(2) learning rate. Obviously can’t be too large as you said, and also can’t be too small. It can take you forever if you set your learning rate to 0.
(3) size of gradient (and batch size). We know the gradient is essentially error averaged over samples in a batch. If we include all samples into one batch, the errors can be averged out, resulting in very small (but smooth) steps. If we have only a single sample per batch, some steps can be larger (and thus the steps are more stochastic). So choosing a batch size balances between the step size and the smoothness of the move, and in other words, in the context of our discussion, it’s sacrificing the step size for smoothness.
Above are the factors: for # of samples, we sometimes can control and sometimes we can’t. For learning rate, we never know which is the most appropriate one, we only know some choices will work and some won’t. For batch size, it’s a balance that we need to have a smooth convergence. So, we can’t guarantee to converge in one epoch!
There can be other factors - e.g. feature normalization. If the features have very different scales and we don’t normalize, we will be forced to choose a learning rate to fit the feature of the largest scale and although that learning rate is appropriate for that feature, it could be too small for the others and therefore it will take more steps for the others to also be converged. Illustration and explanation of this can be found here.
Cheers!
Raymond
