It occurred to me recently that compared to sigmoid and tanh, ReLU consists of two linear parts (> and <0).
Can I accept the intuition that ReLU is “more linear” than sigmoid and tanh?
If I can, I understand that non-linear activation is needed to explore more subtle and non-linear boundaries. For ReLU though, it is in fact linear when X stays below 0 or reaches above 0. Wouldn’t that mean some loss (complete loss in some cases?) of the non-linear component from the model?
Hi @1492r ,
I would say that yes, because the ReLU function is a piecewise linear activation function, while the sigmoid and tanh functions are smooth and non-linear.
However, we need to take care of the “linearity” of ReLu, which has some downsides as well. For instance, as you did mention, ReLU can “kill” or deactivate a neuron by setting its output to zero, if its input is negative. This can cause sparsity in the neural network, which can lead to underfitting.
So, the real answer, IMHO, is that it depends on the specific details of your problem.
Let me provide a different perspective of looking into ReLU. To begin with, I seldom measure “the degree of linearity” of ReLU. Instead, if someone asked me about how well ReLU is as an activation function, picture like the below would show up in my mind
The upper red line is the real data that I want to model, whereas the bottom green line is modeled by a layer of 9 nodes backed with the ReLU activation.
ReLU gives us the “curve” (linear piecewise function) with 9 sections of line. It is linear piecewise because it inherits from ReLU which is also linear piecewise. Therefore, to me, ReLU is cool because we can model any curve in a linear piecewise manner, and then I feel satisfied and will forget about how linear ReLU might look.
You can consider the ReLU function as some kind of „filter“ which passes through positives numbers but blocks everything else to zero. The ability of the neural net to describe and learn non-linear characteristics and cause effects is enabled due the combination of many neurons where the non-linearity is emerging from the „transition to negative part“ of the ReLU function. During the training the „best“ parameters (or weights) can be learned to minimize a cost function, see also this tread:
I see. Thank you guys for the explanation!