In the example that Andrew showed in the course, a training dataset (m = 5,000,000) can be partitioned into 5,000 mini batches, each has 1,000 x-objects.
I’m curious whether we can allow replacements when constructing mini batches. For example, can we randomly select 1,000 objects from m for each batch and repeat this process like 10,000 times to have 10,000 mini batches.
Intuitively, it looks like bootstrapping.
The normal process (which we will see in the programming exercise in W2) is that you randomly shuffle the full dataset on each “epoch” before creating the minibatches. So in the example of m = 5,000,000 and batch size = 1000, you’d have 5,000 minibatches in each epoch, but the contents of each individual minibatch will be different in each epoch. The intent is to smooth out the statistical behavior.
Note that this is not exactly what I think you are proposing. In your scheme some of the data is duplicated in each epoch. It’s not clear whether that is a good thing or not. If you have 10,000 minibatches in an “epoch” that is effectively changing the definition of an epoch. Perhaps it all comes out in the wash and with your scheme you’d end up needing half as many “double epochs” to achieve the same level of convergence you’d get from Prof Ng’s definition. But if I’m understanding your proposal correctly, I think that in both cases the total cost in terms of wall clock time and cpu/gpu time would be essentially the same or very close to it.
@paulinpaloalto made an important observation about the difference between the standard approach and the alternative scheme you proposed. While your proposed approach is unconventional, it could provide insights into how different mini-batch strategies affect training dynamics. However, careful consideration should be given to the potential drawbacks, especially concerning overfitting and bias.
@paulinpaloalto Thank you for your reply, I think you are right.
So I tried this novel mini-batch method through revising the code in W2 assignment.
The cost figure seems to be more zigzagging. But there is basically no difference in accuracy and computing time.
But I’m also wondering, why are the 6.1 6.2 6.3 result figures in W2 so smooth. I didn’t expect that before. Andrew said in the course that using mini-batch gradient descent would make the cost figure seem oscillating than batch gradient descent right?
The answer is to look closely at how those graphs are plotted in the assignment. Notice that they are only showing the cost values at the end of each 100 full epochs, right? So that smooths out all the statistical noise that you get on each individual minibatch.
I see! Thank you so much!