Understanding RELU deeply

How does RELU activation achieve nonlinearity, or to say how can Relu activation in neural networks fit non linear data, or be better than the other activation functions, because it is eventually linear with only a rectified output ? Why not a 2nd degree polynomial activation function ? or even a sigmoid is non linear and should perform better right ? Please help me understand this.

Hi @Ashwin_Yellore

welcome to the community and thanks for your first post.

You can consider the ReLU as some kind of „filter“ which passes through positives numbers but blocks everything else to zero. The ability of the neural net to describe and learn non-linear characteristics and cause effects is enabled due the combination of many neurons where the non-linearity is emerging from the negative part of the ReLU function. During the training the „best“ parameters (or weights) can be learned to minimize a cost function, see also:

You are right that other non linear activation functions can be used like sigmoid as you mentioned, see also this thread.

When it comes to polynomial activation functions, I guess one drawback here is also that they have large differences in the gradients which results in issues as stated in this thread, e.g. a cubic function would probably tend to explode gradient-wise e.g. if the activation input is too large.

Regarding your question on 2nd degree functions: besides the bad suitability concerning exploding gradients (e.g. if activation input is too large) here also a function like x^2 is non-monotonic (for all real numbers as input). But a good activation function should be monotonic to support an effective training w/ back-prop and a consistent „pull“ towards to actual optimum (e.g. if input increases the output should increase), see also this source.

Best regards
Christian

Please let me know if this answers your question and if the provided sources were helpful, @Ashwin_Yellore.

All the best!

Regards
Christian

Hi @Christian_Simonis , Thank you very much for the quick and elaborate response. It is very helpful and also gives me a good intuition on how the activation functions are selected.

1 Like

Maybe the one other point that Christian hinted at would be worth stating more explicitly: the other thing that is really useful about ReLU is that it is incredibly cheap to compute, compared to functions like sigmoid and tanh that involve exponentials. Evaluating transcendental functions is quite expensive computationally, because you’re effectively doing a Taylor Series expansion: there is no exact way to compute things like e^z. So you can think of ReLU as the “minimalist” activation function: it’s way cheap to compute and provides just the bare minimum of non-linearity. It doesn’t always work, because it also has a version of what Prof Ng will later call the “dead neuron” problem (stay tuned for Course 2 for more on that), but it is common practice to try ReLU first for your hidden layer activations. Only in cases where it doesn’t work (doesn’t give good convergence), then you graduate to more expensive and sophisticated activation functions like sigmoid, tanh and swish.

In addition to Paul’s excellent explanation: if you are searching for more info and additional sources: here some thread which could be interesting for you, @Ashwin_Yellore:

Happy learning and all the best
Christian

Hi @paulinpaloalto , Thank you very much for the additional information and hints, it gives me a good insight into the cost of calculation. Good to know that relu is a cheap and effective activation function for most problems, also that there are some exceptions and about the dead neuron problem. It gets very interesting the more i explore.
Hi @Christian_Simonis, Thank you for the additional thread link, i will go thorugh that thread too.