As we know, ReLU function is max(0,x). I am wondering how will this introduce non linearity: For example, when there is a negative input from the forward prop function to relu it gives an output of zero making the neuron dead and not contributing to the neural network computation in next layers. Only the positive inputs to relu contribute to computation and this is linear. How will a non linearity be introduced with a dead relu

ReLU is a non-linear function. Itâ€™s graph is not a straight line. Youâ€™re right that it has the â€śdead neuronâ€ť problem for negative values, but it still a non-linear function which makes it a legitimate activation function for the hidden layers of a neural network. You can think of it as the â€śminimalistâ€ť activation function, because it is also the cheapest in terms of cpu cost of any activation function. So it is common try it as the first possible activation when you are designing a new network to solve some specific problem. It doesnâ€™t always work well, because of the dead neuron problem, but it does work in enough cases that it is at least worth trying. If it doesnâ€™t work, then you try Leaky ReLU, which is almost as cheap to compute and does not have the â€śdead neuronâ€ť problem. Then if that doesnâ€™t work, you graduate to the more expensive functions like tanh, sigmoid, swish and so forth.

Thank you for your response. I am trying to understand the ReLU function only in detail as to how it accounts for non linearity when there are dead neurons and all active neurons give a positive output which is linear with input for a reLU function.

Is the neuron considered dead after multiple iterations or epochs while training?

If we have three neurons in hidden layer and one neuron outputs zero in every iteration after applying relU and other two neurons output positive in every iteration. Then dont we get a linear output from this hidden layer at every iteration. Where is a non linearity here?

ReLU units donâ€™t contribute anything when the input is negative. That makes them functionless.

The non-linearity comes from the transition between negative and positive inputs. Itâ€™s not a strong non-linearity (like youâ€™d get from sigmoid or tanh), but it exists and is enough to do the job.

You need a lot more ReLU units to make up for this weakness though.

Let me make one example of your description: x_0 is in the range of -4 to 4, all weights in the hidden layer is equal to 1, and b_0 = -10 so that the zeroth neuron always output zero, b_1 = b_2 = 10 so that the first and the second neuron always in the positive range.

In this very special case, then yes, itâ€™s just linear. You made an example that is exactly linear.

But you questioned where the non-linearity is. My answer is: it lies in other examples that have a different set of weights and bias.

I need you to shift the focus from your example which is linear to my example which is non-linear. Now, if we look at the output layer of my architecture again, it is the equation that contains all of the trainable parameters in the architecture, so I am going to plug in a different set of values to them:

Your example is linear, and mine is non-linear. Does the process of model training favour an example like yours, or mine? Should it favour a side because of linearity? What is the objective of model training - minimize a cost function or favour your/my example?

Please share your answers to my 2 questions, and we can continue our discussion from there.

Thank you firstly for such a deep explanation. Im clear in some sense in same idea sense you explained which was running in my mind that the example is I stated leads to linearity. But, my question lies there only that what if such a case exist .

How will neural network work in such , will it be a linear regressor kind of model?

And what is dead neuron in case of relu. A neuron is declared dead in a neural network after multiple iterations it still gives an output of zero ?

Now, coming to your questions

With example you have given and two other sample points I selected, I can see the model is non-linear. Please find the plot I drew, hope this is correct.

The favoring of model training where it lies minimizing the cost function is basically based on the end result we desire and data, linear or non-linear.

Please advise. Also, as someone interested in this field let me know if there any projects that I can work with you and learn more.

Very nice! It shows the idea. Here is a more accurate one generated by Python:

EXACTLY! EXACTLY!

The training doesnâ€™t favour any side, it just tries to match the modelâ€™s outputs to the labels. Thatâ€™s all. If the relation between the datasetâ€™s inputs and outputs is indeed linear, then the training algorithm tries to arrive at something like your example. If they are non-linear, then my example.

Itâ€™s simple. Always remember that the training algorithm tries to minimize the cost. In other words, it tries to match the modelâ€™s outputs with the labels.

If the true relation between X and Y is non-linear, but the network is currently in a state of behaving linearly, then the training algorithm should move on to tune the networkâ€™s weights towards a non-linear network. However, there is a chance that somehow the network gets stuck, such that the algorithm cannot tune it to being non-linear, then in this case, the model will end up performs poorer than we expected, and in which case we need to train the model again from a new set of initialized weights in the hope that, this time, it will not get stuck anymore.

â€śLinear regressorâ€ť is a name we human give to a model when the model assumption is linear. It is ONLY possible to be linear. In a multi-layer neural network that uses ReLU as activation, we never call it a linear regressor even though, after training, it may de facto be just linear.

Not really. You know the weights are updated by first calculating their gradients, right? A neuron is dead when its gradients are always zero. The gradient is the point, because if it is zero, it never learns and it is as good as dead.

Hope you will find some buddies, and the chance will be higher if you donâ€™t put your invitation in the middle of a long thread. I usually donâ€™t work in a project with a learner here, but I will be happy to read and discuss their findings in this community.