In “why do we need activation function” andrew has said that we cant use linear activation everywhere because it will result in just normal linear regression (linear function of a linear function is a linear function) but he also said that we can use Relu activation in every hidden layer. It’s a little bit counter-intuitive to me because i think Relu is very close to linear activation and we don’t get much different result from it.
Exactly as Tom says, a function is either linear or it’s not and ReLU is “piecewise” linear, but that is nonlinear. It might seem counterintuitive, but it works. You can think of ReLU as the “minimalist” activation function: it’s incredibly cheap to compute and provides just the bare minimum of nonlinearity. It acts like what they call a “high pass filter” in the signal processing world: it zeros all negative values and passes through the positive values unchanged. It doesn’t always work, because returning zeros for all the negative values is a version of what Prof Ng will later call the “dead neuron problem”. I haven’t taken MLS, so I’m not sure if he discusses that there, but he does in DLS. Because of the low compute cost of ReLU it is common to try that first as the hidden layer activation and in a lot of cases it works just fine. If you don’t get good training results with that, then you try Leaky ReLU which is almost as cheap to compute. If that also doesn’t give good results, only then you graduate to more computationally expensive functions like tanh, sigmoid, swish and others.
In addition to Tom‘s and Paul‘s excellent answers:
We recently had a thread on a similar topic which you probably find interesting. Feel free to take a look!
Happy learning, @cpp219
and best regards
Christian