Gradient Descent in Pytorch vs. TF

Hey, can any master tell me the reason?

In TF if we compute the gradient and then for the next step with optimizer we will give gradient ( derivatives) to it, I think this is purely logically, at least it is so similar to the situation that we do without optimizer, here are the classical codes:

  grads = tape.gradient(loss, model.trainable_weights)
  optimizer.apply_gradients(zip(grads, model.trainable_weights))

However, with PyTorch it is with the classical codes:

  yhat = model(x)
  loss = loss_object(yhat, y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

There isn’t thingy like grads.

I know that both of them work respectively in their own world, however, how does PyTorch do behind?

Thank you

Pytorch does the same thing while performing gradient descent.

When loss.backward() is called, all the trainable weights of the network get differentiated with respect to the loss and all the trainable weights will have their .grad variable accumulated with the gradients.

optimizer.step() updates the trainable weights of the network.

thanks, after I saw those codes in some projects then I understand why , thx

        
        # backward pass: compute gradient of the loss with respect to all the learnable parameters
        loss.backward()
        
        # update parameters slope and bias
        w.data = w.data - lr * w.grad.data
        b.data = b.data - lr * b.grad.data
        
        # zero the gradients before running the backward pass
        w.grad.data.zero_()
        b.grad.data.zero_()