Understanding RELU deeply

Hi @Ashwin_Yellore

welcome to the community and thanks for your first post.

You can consider the ReLU as some kind of „filter“ which passes through positives numbers but blocks everything else to zero. The ability of the neural net to describe and learn non-linear characteristics and cause effects is enabled due the combination of many neurons where the non-linearity is emerging from the negative part of the ReLU function. During the training the „best“ parameters (or weights) can be learned to minimize a cost function, see also:

You are right that other non linear activation functions can be used like sigmoid as you mentioned, see also this thread.

When it comes to polynomial activation functions, I guess one drawback here is also that they have large differences in the gradients which results in issues as stated in this thread, e.g. a cubic function would probably tend to explode gradient-wise e.g. if the activation input is too large.

Regarding your question on 2nd degree functions: besides the bad suitability concerning exploding gradients (e.g. if activation input is too large) here also a function like x^2 is non-monotonic (for all real numbers as input). But a good activation function should be monotonic to support an effective training w/ back-prop and a consistent „pull“ towards to actual optimum (e.g. if input increases the output should increase), see also this source.

Best regards