Why is ReLU any better than Linear

Jaskeerat · April 22, 2021, 9:34am

Hi everyone! I am finding it hard to give my self an intuitive feel for why ReLU is so good as an activation function. The main complication I am facing is the following:
We learn in the lecture that linear activation functions are redundant as they don’t introduce non-linearities and it ends up with the network being just the same as one neuron.
The problem I am having is, ReLU is nothing but linear for all positive inputs? The only thing it does is 0 out negative inputs? So I expect its behaviour to be no better than a linear activation function.

Every time, ReLU gets a positive input it behaves exactly like a linear activation function. So how does it create complex decision boundaries instead of part-erased linear boundaries. I did look up all other Relu related questions but none of the answers helped me with what makes it so much better than a linear one when ReLU itself is linear for all positive inputs.

GordonRobinson · April 22, 2021, 1:01pm

I think of ReLU as being an “if” statement in the computation the network performs. That’s non-linear-enough to get round the pointlessness of composing two linear functions.

In the first layer of a network, each neuron does a split by a hyperplane (line in week 3’s planar classification) in the input value space. Those hyperplanes can be very different, and there are many different overlapping zones in the second layer. That provides the start of the complexity we need.

Great question to be asking. Look at the Tensorflow Playground to play with some examples.

Jaskeerat · April 22, 2021, 1:37pm

Thanks for the reply, In which course’s week 3 is this planar classification of input space talked about, I am in course 1 week 3 and didn’t come across it yet? I do like the if statement analogy, but I’m still unable to get a sense of why how ReLU forms this hyperplane boundary when it is literally just a linear function for positive inputs. Can you direct me to a mathematical proof or some easier explanation to why it doesn’t work as linear. If I were to randomize and provide all positive weights and positive inputs, would ReLU still do any good? I guess not, because it’s now simply a linear function?

neurogeek · April 22, 2021, 1:44pm

Take a look at this good forum discussion with some interesting reading sources here:

Take a look!

GordonRobinson · April 22, 2021, 4:39pm

Course 1, Week 3 assignment “Planar Data Classification with one hidden layer” is what I was referring to.

Relu is “piecewise linear”, but not linear. When ReLU is being used as activation the boundary is the point where the biassed linear combination is zero.

It would be an interesting experiment to start with your “all positive weights, inputs” and see how many epochs it would take before some weights turned negative.

Jaskeerat · April 23, 2021, 10:14am

Thanks, @GordonRobinson, I now think I have an intuition for how it does the trick. I was obssessing over how its piecewise linear but not how that boundary of different derivatives splits input planes. Very cool. I am still however interested in a mathematical proof for personal satisfaction. Similiar to how in logistic regression for eq. log(1-y/y)=w1x1+w2x2…+b and plugging in y=0.5 for decision boundary gives us the equation of a hyperplane. I couldn’t myself work it out for RelU.
I would highly appreciate if you or anyone else could either explain or direct me to a mathematical proof of the same, perhaps a paper?

Thanks a lot!

linhan · May 2, 2021, 12:25pm

I think this medium post also explains it well. I guess in a nutshell, with many short straight lines, u can approximate any function.

Jaskeerat · May 2, 2021, 5:17pm

Okay this was EXACTLY what I needed. Thank you SO much!!

Topic		Replies	Views
C1_W3-Non-Linear_Activation_Function Neural Networks and Deep Learning coursera-platform	1	552	May 18, 2021
How non linear is ReLU? Neural Networks and Deep Learning coursera-platform	4	808	March 17, 2023
Choice of activation function Advanced Learning Algorithms week-module-2	7	693	November 21, 2022
Differences between ReLU and linear for positive values Advanced Learning Algorithms week-module-2	4	744	January 16, 2023
Why do we need Activation function Neural Networks and Deep Learning coursera-platform	4	546	February 16, 2023

Why is ReLU any better than Linear

Related topics