Hey @hazingo,

Please check this thread out for the above query.

It would be wrong to say that in every case, ReLU is better than the Sigmoid activation function, since it really depends on the use-case. For instance, if you want to train a neural network for binary classification, you will want your neural network to produce output between 0 and 1, which can be done with **Sigmoid** but not **ReLU** as the activation function in the output layer.

But in general, for a majority of use-cases, we will find that ReLU performs better than sigmoid, and the reasons are multiple.

- In the case of Sigmoid, the activations get squeezed to the range of (0, 1), and you will find that when we back-propagate through the network, the derivative of \sigma(z) is \sigma(z) (1 - \sigma(z)). So, essentially, we are using the sigmoid function as is during the back-propagation.
- This means that during the back-propagation, the gradients will get multiplied with small numbers between 0 and 1 repeatedly, which will lead the gradients to be very small, also known as the issue of “vanishing gradients”.
- Additionally, you will find that during forward propagation, in the case of sigmoid, you have to compute the exponential, but in the case of ReLU, you only have to compute the maximum element, which is much less computationally expensive than the former.
- The same reason can be applied for the backprop part. For calculating the gradients, you have to compute the exponential in the case of sigmoid, but in the case of ReLU, only a comparison operation needs to be done.

I hope you have got a sense by now, why ReLU is better than Sigmoid. If you find any of these reasons confusing as of now, please feel free to skip them since all of them will be discussed in Course 2 in a great depth.

Now, coming to your first query, i.e.,

It’s actually **averaging** across different examples that we are doing, and not **simple addition**. This is because we have the gradient for each example in a mini-batch, and we want to take the contribution from the gradient corresponding to each of the examples, in order to update the weights.

Now, in your opinion, if we have 5 values of a particular gradient, and we want to use all the values to update the weight, how do you think, we should combine the gradients so that each and every gradient has an equal contribution?

I hope this resolves your first query as well.

Cheers,

Elemento