ReLU activation function

Hi, I’m wondering what makes the ReLU function perform so well and be a default choice for most applications? For example, for negative values we have derivative of 0 - isn’t this normally an issue making other activation functions a poor choice (e.g. sigmoid)?
Thanks!

Hi Matim,
Two advantages of the ReLU function are it’s simplicity and extremely low computational power required to calculate it (forward propagation) and to calculate it’s derivative (back propagation).
Another advantage over the sigmoid() or tanh() functions is that ReLUs are much less affected by the vanishing gradients problem (don’t worry about it just now, it will be explained in Course 2).
They are not a silver bullet. They do have some problems, for not being 0 centered, and for returning 0 for all negative values (in some cases this can produce a “dead neuron”, that is a neuron is stuck with 0 that stops learning)
Nevertheless in practice, due to their advantages, they can produce results much faster than using other activations.
For example, in this paper about ImageNet Network (don’t worry about the details, you’ll see them in the Convolutional Networks Course) they mention a 6x performance increase of a four layer CNN using ReLU over tanh().
This is a great time improvement, but not just that. It also enables the use of even bigger (deeper) neural network to discover solutions to more complex problems.

7 Likes

Hi @javier and @Matim, one simple and interesting survey on these (and others) activation function and their relation with initialization types. In my opinion it helps giving the general picture.

2 Likes

Thank you @javier and @crisrise for your responses. I asked this question with an application of NN to a regression-like problem in mind, where the response variable y can be positive or negative. Having looked at the paper and response above, I think it is the “dead neuron” issue that made me think. Perhaps I’ll get better intuitions after completing course II and getting to experiment with actual data.

I think of ReLU as being an “if” statement in the computation that the network is performing.
Use the value or discard it (if negative).

1 Like

Heuristically, ReLU offers the simplest form of introducing non-linearity into the neural network. An NN must have non-linear activations between layers. Otherwise, there is no meaning to being deep (with linear activations between layers, all collapse to a single one).
It makes much sense to use sigmoid, as a form of non-linearity in NNs. But, if you think about it, a plethora of quasi-linear activations combined from different neurons will have the same non-linear behavior. I’m far from being an expert on the different types of activations. But, a reLU is a good first choice, which may be later replaced to improve training and generalization of the model to unseen data

One additional thing, a beautiful thing about neural networks, is that even if you have some “dead neurons” the rest of the network can “compensate” them.
In fact, in Course 2, you’ll learn a technique called Dropout to reduce the Variance of a NN (the difference between the error obtained in the training/dev set and the error obtained in the test set) that randomly “kills” some neurons (although, just temporally).
So, don’t worry too much about dead neurons just yet, there are a much more concepts to learn first :wink:

1 Like

I will also suggest a paper that includes a lot of information about the most known activation functions that you can have a look at.

Activation Functions: Comparison of trends in Practice and Research for Deep Learning

2 Likes

I actually ran into this problem once. I was on a simple image classification task and chose ReLU as activation. The accuracy was stuck around 60%. I read about this “dead neuron” issue and switched to leaky ReLU activation and it totally solved the problem. So I guess u needn’t worry too much about the dead neurons. There are many neurons in the network anyway. :joy: