Hi, I was wondering what does optimizer.apply_gradients(zip(grads, trainable_variables))
do? And what is its relationship with grads = tape.gradient(minibatch_total_loss, trainable_variables)
Thanks!
Hi @Marcia_Ma,
I will share a brief introduction but leave the rest of the exploration about this topic to you. You will need to experiment them yourself to fully understand what’s going on there.
Recall that in our vanilla gradient descent, we have this weight update formula:
whereas in RMSProp, we have the following instead:
Note that in both formalism, we always need to compute dw
, or \frac{\partial{J}}{\partial{w}}. dw=tape.gradient(J, w)
does that computation for us.
After computing dw
, we have to apply it in the weight update formula and then update the weight w, optimizer.apply_gradients([(w, dw), ])
is responsible for this process. Note that we keep (dw, w)
in a pair by wrapping them into a tuple so that the program knows dw
is the gradient with respect to w
.
I recommend you to read this doc page for more on auto differentiation, and this for the use of apply_gradients
. Google for more examples.
Cheers,
Raymond
Thank you so much Raymond, that is really helpful!
You are welcome @Marcia_Ma!