Hi everyone! I am finding it hard to give my self an intuitive feel for why ReLU is so good as an activation function. The main complication I am facing is the following:
We learn in the lecture that linear activation functions are redundant as they don’t introduce non-linearities and it ends up with the network being just the same as one neuron.
The problem I am having is, ReLU is nothing but linear for all positive inputs? The only thing it does is 0 out negative inputs? So I expect its behaviour to be no better than a linear activation function.

Every time, ReLU gets a positive input it behaves exactly like a linear activation function. So how does it create complex decision boundaries instead of part-erased linear boundaries. I did look up all other Relu related questions but none of the answers helped me with what makes it so much better than a linear one when ReLU itself is linear for all positive inputs.

I think of ReLU as being an “if” statement in the computation the network performs. That’s non-linear-enough to get round the pointlessness of composing two linear functions.

In the first layer of a network, each neuron does a split by a hyperplane (line in week 3’s planar classification) in the input value space. Those hyperplanes can be very different, and there are many different overlapping zones in the second layer. That provides the start of the complexity we need.

Great question to be asking. Look at the Tensorflow Playground to play with some examples.

Thanks for the reply, In which course’s week 3 is this planar classification of input space talked about, I am in course 1 week 3 and didn’t come across it yet? I do like the if statement analogy, but I’m still unable to get a sense of why how ReLU forms this hyperplane boundary when it is literally just a linear function for positive inputs. Can you direct me to a mathematical proof or some easier explanation to why it doesn’t work as linear. If I were to randomize and provide all positive weights and positive inputs, would ReLU still do any good? I guess not, because it’s now simply a linear function?

Course 1, Week 3 assignment “Planar Data Classification with one hidden layer” is what I was referring to.

Relu is “piecewise linear”, but not linear. When ReLU is being used as activation the boundary is the point where the biassed linear combination is zero.

It would be an interesting experiment to start with your “all positive weights, inputs” and see how many epochs it would take before some weights turned negative.

Thanks, @GordonRobinson, I now think I have an intuition for how it does the trick. I was obssessing over how its piecewise linear but not how that boundary of different derivatives splits input planes. Very cool. I am still however interested in a mathematical proof for personal satisfaction. Similiar to how in logistic regression for eq. log(1-y/y)=w1x1+w2x2…+b and plugging in y=0.5 for decision boundary gives us the equation of a hyperplane. I couldn’t myself work it out for RelU.
I would highly appreciate if you or anyone else could either explain or direct me to a mathematical proof of the same, perhaps a paper?