Can someone please say what are the highlight points that saying why do we need non-linear activation function in this lecture vide (why do we need non linear activation function ? )
We watched the video but unable to figure out or get good feeling that because of these points we can must need non linear activation function
Hi @Anbu in my opinion the core part is at minute 2.23 when Andrew says that without non-linear activation function there is no need of deep network because a combination of linear functions can be reducted to a single linear function. So the power of deep net comes from the combination of linear and non linear operator.
Yeah Andrew says that with non linear functions for the hidden layers we get more interesting functions as outputs. But I think we miss the reason for why non-linear or “more interesting” functions are good. Is it because in high dimensional spaces these nonlinearities favour separability? Could someone expand or provide references?
There are two levels (at least) to the answer here. The most important point is what @crisrise said in his earlier response on this thread:
The composition of linear functions is still linear. What that means is that if you don’t include non-linearity at every layer of a neural network, then there is literally no point in having multiple layers: they all collapse into one layer. Without the non-linear activation functions in every hidden layer, every neural network would be functionally equivalent to Logistic Regression, which can only do linear separation and is not nearly as powerful at classification as deep neural networks.
Once you have the required non-linearity, then you can add as many layers with as many neurons as required to learn a function that is complex enough to provide a mapping (function) that is close enough to the complexity of your actual data in order to give accurate predictions. Why would you not want the ability to have a “more interesting” function as opposed to a “less interesting” function?
Thanks a lot @paulinpaloalto for your very intuitive answer, it’s all clear now. It makes sense: introducing nonlinearities at every layer makes these models highly non-linear in the feature space (which is at this point in the course the input data space). These highly non-linear functions can learn very complex data, even to the point of overfitting.