Best place optimizer.zero_grad() call in training loop?

yildirimga · December 27, 2025, 8:56pm

Hi All,

I have been studying deeplearning and pytroch as like all of you, I keep seeing two patterns to putting optimizer.zero_grad() function calls, one, immedatetly call in train loops and other pattern just before step function.

Pattern 1:

for inputs, labels in dataloder:

Pattern 1 (zeroing gradient as soon as enter loops):
for inputs, labels in dataloder:
            optimizer.zero_grad()
            ... some code
            ... some code
            optimizer.step()

Pattern 2 (call zeroing gradients just before update):
for inputs, labels in dataloder:
            ... some code
            ... some code
            optimizer.zero_grad()            
            optimizer.step()

I propose deeplearning.ai codes to pattern 3, which I think that it is more safe since, both patterns leave the gradient left if they want to use the model in some other place.

my suggested pattern is this:


Pattern 3 (call zeroing gradients twice before loop and after step update):
optimizer.zero_grad() #first call
for inputs, labels in dataloder:
            ... some code
            ... some code
            optimizer.step()
            optimizer.zero_grad() # second call

Espeically in an Jupiter enviorment, model weights are not cleared by itself if someone to reuse the model, the gradient will accumulate from the previous run.

if you want to test it, rerun any train_loops twice. I expect the same convergence behavior and found that they are different, of course, since previous gradients are not zeroed after final training loop.

Deepti_Prasad · December 27, 2025, 9:41pm

hi @yildirimga

That’s an incorrect pattern 3 suggestion.

Adding optimizer.zero_grad() at the beginning of the training loop iteration is the standard and correct practice, not at the end.

The purpose of optimizer.zero_grad() is to clear the gradients calculated during the previous training iteration. Gradients accumulate by default in deep neural network.

If you place it at the end of the loop, the gradients from the current step would be cleared before the optimizer has a chance to use them to update the model’s parameters. The parameters would never be updated correctly.

by placing zero.gradient at the beginning:

Each iteration starts with a clean slate for gradients.
The gradients calculated in the backward pass are used to update the model parameters.
The optimizer updates the parameters based on the correct, non-accumulated gradients for the current batch of data.

Regards
DP

yildirimga · December 27, 2025, 9:45pm

Deepti,

I disagree, step update parameters and zerioing gradients after step is safe, otherwise, when you are done with the last step, the model still carries the last step calcuclated gradient.

if model is still in memory and want to use it, most people assume that those are initizlized as zero. just try to rerun train loops twice and observe the behavior, they are different.

paulinpaloalto · December 27, 2025, 9:55pm

Your method does seem “cleaner”. But my take would be that leaving the gradients from the last step in place at the end of training is not really a “correctness” issue. The gradients are not used when you execute the model in inference mode, right? And if someone takes the existing trained model and wants to use it for Transfer Learning and do “fine tuning”, they will clear the gradients just as your logic does before starting their own iterations of further training. So the gradient data will simply be unused and unreferenced.

Then it comes down to perhaps a performance issue depending on the internals of how torch handles the gradient data. If it just frees all the tensors when you call zero_grad(), then you could save memory. But if it simply replaces them with zero values, then you don’t even save memory with the zero_grads(). Arguably you’ve just wasted some relatively small amount of cpu time with the torch.zeros calls, which essentially have null effect (no improvement in either performance or correctness).

Deepti_Prasad · December 27, 2025, 10:03pm

so paul it is also about iterative computational cost effectiveness ?right?

@yildirimga I got your point especially when you are using it after update method.

paulinpaloalto · December 27, 2025, 10:08pm

But the compute cost is the same when you’re actually running training, regardless of where you put the zero_grads() call, isn’t it? You call it once per iteration.

The only real difference other than the readability of the code is the effect that @yildirimga is advocating that the model at the end of the training operation has zeroed gradients. And my argument above is that is probably a trivial effect. But it’s fine to do it that way.

paulinpaloalto · December 27, 2025, 10:10pm

Well, I guess you end up calling zero_grads n + 1 times with method 3, whereas you’d call it n times with the other two methods, where n is the total number of iterations. But that’s a relatively trivial cost compared to training writ large, since n is typically O(10^4) or greater.

paulinpaloalto · December 27, 2025, 10:15pm

Well, now that I think \epsilon harder, there’s a difference between the memory size used when the model is loaded in memory (in which case the gradient tensors have the same size whether they are zeros or not) and when you store the trained model on disk. In the on disk case, you’ll be using some sort of compression algorithm and then the zeros will compress much better than the non-zero values. So maybe it is worth it to leave the gradients zeroed at the end of the training run.

Deepti_Prasad · December 27, 2025, 10:18pm

ok so @yildirimga point does hold significance.

Deepti_Prasad · December 29, 2025, 4:12pm

@Mubsi

please have a look at the learner’s suggestion. He doesn’t seem to be incorrect, and is worth looking.

@yildirimga response from the staff might be delayed as they are in leave until Jan 2nd.

Thank you for your suggestion, even I got to know something I assumed incorrectly earlier.

regards
DP

Mubsi · January 6, 2026, 9:50am

For the vast majority of cases, specifically for those learning PyTorch, sticking to the standard practice of placing the zero grad call at the start of the loop will get the job done without any issues. It is the convention everyone expects to see and it handles the math correctly (this is what you see in almost all of the tutorials and documentations), so you really do not need to overthink it.

That being said, the suggestion to place it after the update step is a clever tweak for power users. It helps keep the model clean when you are rerunning cells in a notebook and can even help with file compression when saving models. But unless you are trying to optimise for those specific edge cases, the standard pattern is perfectly fine.

Topic		Replies	Views
Gradient Descent in Pytorch vs. TF AI Discussions	2	126	April 20, 2022
C2_M3_Lab_2_embeddings - Suspected bug in the code PyTorch: Techniques and Ecosystem Tools week-module-3 , dl-ai-learning-platform	17	134	January 14, 2026
TensorFlow function train_step() Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	631	February 5, 2022
C2 W2 Optimization Methods Assignment Improving Deep Neural Networks: Hyperparameter tun week-module-2 , coursera-platform	6	46	October 29, 2024
Use of apply method of optimizer Custom and Distributed Training with TF week-module-2	6	53	March 18, 2025

Best place optimizer.zero_grad() call in training loop?

Related topics