How we use ReLU as an activation function instead of sigmoid and in course 1 we made up sigmoid in classification problems instead of linear functions because linear function does not fit well for binary outputs ?
A few reasons I think:
- ReLU formula is simpler. So it is easier to train.
- Experiments shows that ReLU outperforms sigmoid.
ReLU is only used as a hidden layer. You still need sigmoid to get the classes at the output.
There are tradeoffs:
It’s computationally easy to compute the gradients for ReLU. But since you get no gradients for negative z values, you need more ReLU units than if you used sigmoid().
Hello @Mohamed_Hussien1,
I just hope to make sure that after the mentors’ replies, you have got the idea that we don’t always consider an activation function in how we use it in the output layer. In constrast, ReLU is more commonly used in the hidden layers and we like its characteristics non-linearity and a simple one it is.
Cheers,
Raymond