Hi there,
benefits are:
- there is a reduced risk of vanishing gradients since the gradient in the positive section of the ReLU function is constant. It does not saturate in contrast to sigmoid or tanh
- you can describe well non-linearity as stated in this thread: Isn't Relu just a lineer regression function for z>=0 - #6 by Christian_Simonis With a pure linear activation function this would not be possible.
- ReLU is easy to compute with y = max(0,x) and therefore it is often faster compared to other alternatives
Please let me know if this answers your question!
Best regards and happy new year!
Christian