ReLU activation function

Matim · April 19, 2021, 9:11pm

Hi, I’m wondering what makes the ReLU function perform so well and be a default choice for most applications? For example, for negative values we have derivative of 0 - isn’t this normally an issue making other activation functions a poor choice (e.g. sigmoid)?
Thanks!

javier · April 19, 2021, 11:42pm

Hi Matim,
Two advantages of the ReLU function are it’s simplicity and extremely low computational power required to calculate it (forward propagation) and to calculate it’s derivative (back propagation).
Another advantage over the sigmoid() or tanh() functions is that ReLUs are much less affected by the vanishing gradients problem (don’t worry about it just now, it will be explained in Course 2).
They are not a silver bullet. They do have some problems, for not being 0 centered, and for returning 0 for all negative values (in some cases this can produce a “dead neuron”, that is a neuron is stuck with 0 that stops learning)
Nevertheless in practice, due to their advantages, they can produce results much faster than using other activations.
For example, in this paper about ImageNet Network (don’t worry about the details, you’ll see them in the Convolutional Networks Course) they mention a 6x performance increase of a four layer CNN using ReLU over tanh().
This is a great time improvement, but not just that. It also enables the use of even bigger (deeper) neural network to discover solutions to more complex problems.

crisrise · April 20, 2021, 5:05am

Hi @javier and @Matim, one simple and interesting survey on these (and others) activation function and their relation with initialization types. In my opinion it helps giving the general picture.

Matim · April 20, 2021, 1:53pm

Thank you @javier and @crisrise for your responses. I asked this question with an application of NN to a regression-like problem in mind, where the response variable y can be positive or negative. Having looked at the paper and response above, I think it is the “dead neuron” issue that made me think. Perhaps I’ll get better intuitions after completing course II and getting to experiment with actual data.

GordonRobinson · April 20, 2021, 2:12pm

I think of ReLU as being an “if” statement in the computation that the network is performing.
Use the value or discard it (if negative).

yanivh · April 20, 2021, 5:19pm

Heuristically, ReLU offers the simplest form of introducing non-linearity into the neural network. An NN must have non-linear activations between layers. Otherwise, there is no meaning to being deep (with linear activations between layers, all collapse to a single one).
It makes much sense to use sigmoid, as a form of non-linearity in NNs. But, if you think about it, a plethora of quasi-linear activations combined from different neurons will have the same non-linear behavior. I’m far from being an expert on the different types of activations. But, a reLU is a good first choice, which may be later replaced to improve training and generalization of the model to unseen data

javier · April 20, 2021, 8:56pm

One additional thing, a beautiful thing about neural networks, is that even if you have some “dead neurons” the rest of the network can “compensate” them.
In fact, in Course 2, you’ll learn a technique called Dropout to reduce the Variance of a NN (the difference between the error obtained in the training/dev set and the error obtained in the test set) that randomly “kills” some neurons (although, just temporally).
So, don’t worry too much about dead neurons just yet, there are a much more concepts to learn first

ChrisPliakos · April 22, 2021, 6:49pm

I will also suggest a paper that includes a lot of information about the most known activation functions that you can have a look at.

Activation Functions: Comparison of trends in Practice and Research for Deep Learning

linhan · May 2, 2021, 12:16pm

I actually ran into this problem once. I was on a simple image classification task and chose ReLU as activation. The accuracy was stuck around 60%. I read about this “dead neuron” issue and switched to leaky ReLU activation and it totally solved the problem. So I guess u needn’t worry too much about the dead neurons. There are many neurons in the network anyway.

Topic		Replies	Views
ReLU vs Sigmoid function Neural Networks and Deep Learning week-1	2	29	December 24, 2024
DL and NN course1 Week#3: Understanding Activation functions Neural Networks and Deep Learning week-3	2	30	March 4, 2025
Activation function in NN NLP with Classification and Vector Spaces week-3	3	332	March 30, 2022
Activation functions Convolutional Neural Networks	3	723	January 4, 2023
Course1 - Week3 Assignment - ReLU gave worse performance than tanh Neural Networks and Deep Learning	3	548	September 9, 2021

ReLU activation function

Related topics