Choice of activation function

Hi everyone! Today I learned about how and which activation function to use for the hidden layers.
Prof. Andrew mentions that using linear activations for all the hidden layers results in a logistic regression model itself (considering the last layer uses sigmoid function) but, then he also says using the RELU function would be an efficient alternative and I noticed the Relu function isn’t very different from the linear function if we only consider positive values for ‘x’ so how exactly would this be an alternative if it’s doing the same thing?

Hi @Srivaths_Gondi

I think it will help you machine learning - Why is ReLU used as an activation function? - Data Science Stack Exchange.

Please feel free to ask any questions,

ReLU function is different from a linear function - ReLU is linear in the range [0,\infty] and non-linear in the overall range [-\infty,\infty]. It is this non-linearity that we exploit in the hidden layers, to be able to model any kind of output function.

1 Like

Hi! I think this article will help you clarify your question.

Right! In mathematics, there is no such thing as “almost linear”: it’s either linear or it’s not. ReLU is “piecewise linear”, but that is a very different thing than “linear”.

You could think of ReLU as the “minimalist” activation function: it is dirt cheap to compute and provides the most basic form of non-linearity. It doesn’t always give good results in every application, because it also has the “dead neuron” problem for all z < 0, but it’s the first thing to try because of its computational efficiency. If it doesn’t work in your case, then you try Leaky ReLU, which fixes the “dead neuron” issue and is still very cheap to compute. If that doesn’t work, only then to you “graduate” to more expensive functions based on the exponential function like tanh, sigmoid, swish and so forth.

1 Like

Thank you for your answer…I think for us newbie, the non-linear means ‘curve’ to us…so at beginning it doesn’t seem Relu is curving…But when you explain that it’s a ‘dirty cheap’ approximation, I think I got it. Now it sounds interesting…I start to wonder how thought about Relu first.

I haven’t really studied the history here, but the idea and usefulness of ReLU predates ML/DL by quite a bit. I know it was used in Signal Processing for a long time. If you think about it, it’s what you would call a “high pass filter”: it drops everything below a certain value and passes through the values above that. You can think of ReLU as a high pass filter with the threshold = 0.

This is a pattern in a lot of ML/DL: the mathematics is not new. It’s being recycled or repurposed from earlier applications in Statistics, Physics and other fields. E.g. the sigmoid function and cross entropy loss have been around since the 18th century. Look up Maximum Likelihood Estimation from Statistics for the history in that example. I think most of that work was done by Gauss, who was one of the towering figures in 18th century mathematics. That’s Gauss as in Gaussian Distribution, Gaussian Elimination and many more …

And all the optimization techniques like Gradient Descent have many other earlier applications.

1 Like

In addition to the very good answers for illustrative purposes (although it may be obvious to many - but it helped several students when dealing with ReLU for the first time):

Since many neurons are assigned with bias and weights, linked with an activation function, by combination of multiple neurons (as the neural net in total does) this also allows to learn highly nonlinear behavior, although the activation function of one neuron itself (as you correctly pointed out) possesses only a piecewise linear activation function in case of ReLU.