Hi,
Do we have to implement GD with Momentum, ADAM and RMSprop ONLY WITH mini-batches where the mini-batch != m ?
Can we implement them with the old single batch approach?
Hi,
Do we have to implement GD with Momentum, ADAM and RMSprop ONLY WITH mini-batches where the mini-batch != m ?
Can we implement them with the old single batch approach?
The logic is independent of the batch size, so it is not a question of correctness. The only question is whether it does any good in the full batch case. To the extent that you are using those more sophisticated optimizations to mitigate the higher stochasticity you get with smaller batches, that would seem to argue that it may not be that useful. But it’s also possible that you still get benefit because the cost surfaces are so complex in any case. The other thing to ask here is whether anyone actually ever does full batch GD any more. I don’t really know the answer there in terms of the overall industry “practice”, but there is the famous Yann Lecun quote: “Friends don’t let friends use batch sizes greater than 32”. Or words to that effect …