Can I know in what scenario one would prefer LeakyRelu than Relu activation function?

Can I know when does one prefer LeakyRelu activation function over Relu activation function?

How these both differ from each other?

Although I understand why Relu is most common choice of activation function over sigmoid activation function, I want to know by example or by any model algorithm when LeakyRelu is preferred than Relu activation Function?

Thank you in advance.


Hi DP,

Leaky relu is good to avoid diminishng weights and gradients in the negative domain, depending on the application the weights might oscilate in the negative region.

And I quote from google

“This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks”

1 Like

Hello Gent,

From what I know or understand, relu does not give negative domain!!!

and In case if there is negative domain, the choice should be linear activation function.

so are you telling LeakyRelu would be more a combination of these two activation function???

Now I am confused with your answer :thinking: :slight_smile:


From Wikipedia:

So in the left side of the figure, when x is negative, Leaky ReLU gives small negative values (instead of zero which ReLU would give).

1 Like

Thank you Both of you, I understood now how it differs from Relu.

Can anyone give me an example how to determine if my model algorithm would require LeakyRelu activation function based on dataset or model architecture.

Thank you in advance!!


I’ve never used Leaky ReLU, so I don’t have any info but what I find online and from previous discussions.

“Require” is too strong a word. There may be benefits, it depends on the data set (if it has lots of features with negative values), and the complexity of the model (minimizes vanishing gradients, so training may be more efficient if you have a deep network).

The negative region slope is another parameter you can tune (0.01 is just a common value, more completely that’s an “alpha”, which you can adjust). So this gives you more work to do, finding the best alpha value.

Since Leaky ReLU units never become dead (as they still provide some useful output for negative values), you may need fewer units than if you are using ReLU.

But since you have another multiplication to implement, training will be more costly with Leaky ReLU.

Experimentation is the best course.

So basically you are saying if the dataset contains any of the negative value which could be important in training model or any other features, we could go with LeakyRelu??

Cost part you are stating because of the wide range of probability right?? and that would require more training or model architecture to experiment with?

Yes for your first question.

For the second, I don’t think probability is involved. It’s just that Leaky ReLU requires different multipliers for positive and negative values, and that’s going to cost more CPU time to compute. So training will consume more computer resources.

1 Like

Now I have one more doubt as you and Gent mentioned because of negative value or region slope, we tend to use LeakyRelu, Why didn’t they go for Linear activation function which would cover wide range of domain with variability!!!

Is it because LeakyRelu can get the non-linearity relation between the parameters and has negative values, Leaky value is used??

A non-linear function is required in an NN hidden layer. Both ReLU types are non-linear.

1 Like

Thank you both of you @gent.spah and @TMosh TMosh for addressing my query post.

I have much better understanding of LeakyRelu Activation function now.


1 Like