Hi, I was wondering what does `optimizer.apply_gradients(zip(grads, trainable_variables))`

do? And what is its relationship with `grads = tape.gradient(minibatch_total_loss, trainable_variables)`

Thanks!

Hi @Marcia_Ma,

I will share a brief introduction but leave the rest of the exploration about this topic to you. You will need to experiment them yourself to fully understand what’s going on there.

Recall that in our vanilla gradient descent, we have this weight update formula:

whereas in RMSProp, we have the following instead:

Note that in both formalism, we always need to compute `dw`

, or \frac{\partial{J}}{\partial{w}}. `dw=tape.gradient(J, w)`

does that computation for us.

After computing `dw`

, we have to apply it in the weight update formula and then update the weight w, `optimizer.apply_gradients([(w, dw), ])`

is responsible for this process. Note that we keep `(dw, w)`

in a pair by wrapping them into a tuple so that the program knows `dw`

is the gradient with respect to `w`

.

I recommend you to read this doc page for more on auto differentiation, and this for the use of `apply_gradients`

. Google for more examples.

Cheers,

Raymond

Thank you so much Raymond, that is really helpful!