Why use `average` when vectorizing the backpropagation calculations(C1_W4, page17)

Hi, in C1_W4 lecture notes page 17: I don’t understand why using average to calculate dW and db for m examples. Could someone explain? Thanks!

Because the gradients of W and b are gradients (partial derivatives) of the cost J and the definition of J is that it is the average of the loss values L across all the samples. L is a vector quantity and J is a scalar.

Of course the other thing to remember is that the derivative of the average is the average of the derivatives. Think about it for a second and that should make sense.

And if the next question is, ok, then why are the other gradients not averages? It is because everything in all those formulas other than dW and db are just “chain rule” factors used to compute dW and db, so they aren’t derivatives of J, but are derivatives of other things.

1 Like

Thanks! Got it now!