When calculating the derivative of dW, why do you add it to dZ * X across all m training sets?

If you go to “Gradient Descent on m Examples” in week 2, I don’t understand why calculating dW would mean adding it across different examples.

I also don’t get why is a ReLu function that great in week 3 compared to a sigmoid function and also why is a tanh clearly superior to a sigmoid function.

Thank you so much for your time

Hey @hazingo,

Please check this thread out for the above query.

It would be wrong to say that in every case, ReLU is better than the Sigmoid activation function, since it really depends on the use-case. For instance, if you want to train a neural network for binary classification, you will want your neural network to produce output between 0 and 1, which can be done with Sigmoid but not ReLU as the activation function in the output layer.

But in general, for a majority of use-cases, we will find that ReLU performs better than sigmoid, and the reasons are multiple.

  • In the case of Sigmoid, the activations get squeezed to the range of (0, 1), and you will find that when we back-propagate through the network, the derivative of \sigma(z) is \sigma(z) (1 - \sigma(z)). So, essentially, we are using the sigmoid function as is during the back-propagation.
  • This means that during the back-propagation, the gradients will get multiplied with small numbers between 0 and 1 repeatedly, which will lead the gradients to be very small, also known as the issue of “vanishing gradients”.
  • Additionally, you will find that during forward propagation, in the case of sigmoid, you have to compute the exponential, but in the case of ReLU, you only have to compute the maximum element, which is much less computationally expensive than the former.
  • The same reason can be applied for the backprop part. For calculating the gradients, you have to compute the exponential in the case of sigmoid, but in the case of ReLU, only a comparison operation needs to be done.

I hope you have got a sense by now, why ReLU is better than Sigmoid. If you find any of these reasons confusing as of now, please feel free to skip them since all of them will be discussed in Course 2 in a great depth.

Now, coming to your first query, i.e.,

It’s actually averaging across different examples that we are doing, and not simple addition. This is because we have the gradient for each example in a mini-batch, and we want to take the contribution from the gradient corresponding to each of the examples, in order to update the weights.

Now, in your opinion, if we have 5 values of a particular gradient, and we want to use all the values to update the weight, how do you think, we should combine the gradients so that each and every gradient has an equal contribution?

I hope this resolves your first query as well.

Cheers,
Elemento

That is because dW is a derivative of the cost J and J is the average of the loss L across all the samples in the batch. The derivative of the average is the average of the derivatives. Think about it for a second and that should make sense. The average includes a sum, of course, which is why dW includes a sum.

Thank you so much, I get it now! Thanks again, this was really helpful