I have been studying deeplearning and pytroch as like all of you, I keep seeing two patterns to putting optimizer.zero_grad() function calls, one, immedatetly call in train loops and other pattern just before step function.
Pattern 1:
for inputs, labels in dataloder:
Pattern 1 (zeroing gradient as soon as enter loops):
for inputs, labels in dataloder:
optimizer.zero_grad()
... some code
... some code
optimizer.step()
Pattern 2 (call zeroing gradients just before update):
for inputs, labels in dataloder:
... some code
... some code
optimizer.zero_grad()
optimizer.step()
I propose deeplearning.ai codes to pattern 3, which I think that it is more safe since, both patterns leave the gradient left if they want to use the model in some other place.
my suggested pattern is this:
Pattern 3 (call zeroing gradients twice before loop and after step update):
optimizer.zero_grad() #first call
for inputs, labels in dataloder:
... some code
... some code
optimizer.step()
optimizer.zero_grad() # second call
Espeically in an Jupiter enviorment, model weights are not cleared by itself if someone to reuse the model, the gradient will accumulate from the previous run.
if you want to test it, rerun any train_loops twice. I expect the same convergence behavior and found that they are different, of course, since previous gradients are not zeroed after final training loop.
Adding optimizer.zero_grad() at the beginning of the training loop iteration is the standard and correct practice, not at the end.
The purpose of optimizer.zero_grad() is to clear the gradients calculated during the previous training iteration. Gradients accumulate by default in deep neural network.
If you place it at the end of the loop, the gradients from the current step would be cleared before the optimizer has a chance to use them to update the model’s parameters. The parameters would never be updated correctly.
by placing zero.gradient at the beginning:
Each iteration starts with a clean slate for gradients.
The gradients calculated in the backward pass are used to update the model parameters.
The optimizer updates the parameters based on the correct, non-accumulated gradients for the current batch of data.
I disagree, step update parameters and zerioing gradients after step is safe, otherwise, when you are done with the last step, the model still carries the last step calcuclated gradient.
if model is still in memory and want to use it, most people assume that those are initizlized as zero. just try to rerun train loops twice and observe the behavior, they are different.
Your method does seem “cleaner”. But my take would be that leaving the gradients from the last step in place at the end of training is not really a “correctness” issue. The gradients are not used when you execute the model in inference mode, right? And if someone takes the existing trained model and wants to use it for Transfer Learning and do “fine tuning”, they will clear the gradients just as your logic does before starting their own iterations of further training. So the gradient data will simply be unused and unreferenced.
Then it comes down to perhaps a performance issue depending on the internals of how torch handles the gradient data. If it just frees all the tensors when you call zero_grad(), then you could save memory. But if it simply replaces them with zero values, then you don’t even save memory with the zero_grads(). Arguably you’ve just wasted some relatively small amount of cpu time with the torch.zeros calls, which essentially have null effect (no improvement in either performance or correctness).
But the compute cost is the same when you’re actually running training, regardless of where you put the zero_grads() call, isn’t it? You call it once per iteration.
The only real difference other than the readability of the code is the effect that @yildirimga is advocating that the model at the end of the training operation has zeroed gradients. And my argument above is that is probably a trivial effect. But it’s fine to do it that way.
Well, I guess you end up calling zero_grads n + 1 times with method 3, whereas you’d call it n times with the other two methods, where n is the total number of iterations. But that’s a relatively trivial cost compared to training writ large, since n is typically O(10^4) or greater.
Well, now that I think \epsilon harder, there’s a difference between the memory size used when the model is loaded in memory (in which case the gradient tensors have the same size whether they are zeros or not) and when you store the trained model on disk. In the on disk case, you’ll be using some sort of compression algorithm and then the zeros will compress much better than the non-zero values. So maybe it is worth it to leave the gradients zeroed at the end of the training run.
For the vast majority of cases, specifically for those learning PyTorch, sticking to the standard practice of placing the zero grad call at the start of the loop will get the job done without any issues. It is the convention everyone expects to see and it handles the math correctly (this is what you see in almost all of the tutorials and documentations), so you really do not need to overthink it.
That being said, the suggestion to place it after the update step is a clever tweak for power users. It helps keep the model clean when you are rerunning cells in a notebook and can even help with file compression when saving models. But unless you are trying to optimise for those specific edge cases, the standard pattern is perfectly fine.