Why we use relu activation function in the hidden layers comapre to other activation functions like sigmoid,linear activation,tanh etc.,
- there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh
- you can describe well non-linearity as stated in this thread: Isn't Relu just a lineer regression function for z>=0 - #6 by Christian_Simonis With a pure linear activation function this would not be possible.
- ReLU is easy to compute with y = max(0,x) and therefore it is often faster compared to other alternatives
Please let me know if this answers your question!
Best regards and happy new year!
In case you are interested in reading more information, also with respect to evaluation of different activation functions, feel free to take a look at this paper: https://arxiv.org/pdf/2109.14545.pdf