While in the lecture videos there is explanation as to why an activation function is used (to introduce non-linearity), it does not address why non-linearity is needed, at least to the extent that I have understood the material.
Can someone please explain for a more conceptual understanding why non-linearity is needed for the learning to take place?
Thanks in advance.
when you using linear function in all layers you didn’t benefit from the logic of neural network as you will end with the linear function like the screen shot so Therefore, from the beginning, I did not do a multi-layered neuron network, and I was satisfied with a single layer with Linear Function, which is exactly the same as the the regression of ML (Machine learning algorithms) and reduced the time and computations which neuron network take but it isn’t correct…so we use nonlinear function to do and compute more complex and complex function and computations which can take a best decisions and best accurate result
Please feel free to ask any questions,
One other argument is that linear functions, the dot products in this case, are capable only of classifying linearly separable data of course, whereas the data your’e training the neural net on is in most cases not linearly separable.
Adding to what @AbdElRhaman_Fakhry , may be just in my own words:
There are 2 conditions that every Activation function have to meet:
They have to be non-linear: This will allow the neural network to have complex features. If it were linear, then all the layers could be collapsed into a simple linear regression.
They have to be differentiable: This is important because it needs to differentiate and provide a gradient to the previous layer in the back propagation, to allow the neural network to learn.
Notice that what @AbdElRhaman_Fakhry did was to show you a proof of the one dimensional version of the following theorem:
The composition of linear functions is a linear function.
That is a true statement in general (any number of dimensions). What the proof shows is that feeding the output of the first linear function into the second linear function just gives you a different linear function.
So if your neural network layers are only linear, then there is literally no point in having more than one layer: you do not get any additional complexity in the function you can represent meaning that the network is still a linear function just with different coefficients. So all linear neural networks are equivalent to Logistic Regression.
@Juan_Olano , @paulinpaloalto, @AbdElRhaman_Fakhry & @yousafe007 thank you all for your responses.
From the explanation in lecture and what you have outlined here, I understand that short of introducing non-linearity by way of the activation functions we continue to have a linear model but one that is presented in different terms.
What both @Juan_Olano and @paulinpaloalto have alluded to though only briefly is what I believe to be the thing that I want to understand further- that is that complexity is introduced in the model via the activation functions. What does that actually mean though? What does that do for the model? A graphical depiction will serve best if possible.
As before, thanks in advance.
I think you’re missing the forest for the trees here. This is not some deep or subtle point. The point is that a linear function is a straight line or a plane in higher dimensions, right? But a non-linear function can represent a “decision boundary” that is not a straight line or a hyperplane. Not straight is more complex than straight, right? And if you “compose” non-linear functions it gets even more complex.
I see. Am I correct in interpreting what you (@paulinpaloalto) have said as the activation function serving a similar purpose to that of a best fit line? In essence you are finding a function, more complex than a line or plane, to best fit your data. Is that correct? If so, I can now also make better sense of what @yousafe007 has stated.
Yes, I think that’s the right overall idea, although the activation functions do not directly define the boundary. In the case that the neural network is performing a binary classification, it’s better not to picture it as a “curve of best fit” but as the “decision boundary” that separates the “yes” answers from the “no answers”. If you use Logistic Regression, the learned weights and bias values define the hyperplane that separates the yes and no answers. With non-linearity added to the mix at each layer, the network can learn a decision boundary that is a surface with very complex twists and turns.
The other point is that every layer in the neural network consists of two steps:
The Linear Activation, which is just the linear combination of the input with the weights and bias values.
The non-linear activation which is applied “elementwise” after step 1.
Of course that means that the only way to introduce non-linearity is through the activation functions. That’s why it is so crucial that they be non-linear.
Thanks for the detailed explanation, @paulinpaloalto .
I now have a much better mental image of what is happening, and understanding of the purpose of each component, the parameters (W,b) and the activation function.
It’s great that it helped! Maybe the one other point to make is that the activation functions don’t directly define the decision boundary. They just give you the ability to represent an extremely complex function. Then the real function is learned by using back propagation and gradient descent, based on the choices you made for the number of layers, numbers of neurons and which activation functions to use (ReLU, Leaky ReLU, tanh, swish, sigmoid …) at each layer all driven by your actual labelled training data and the cost/loss function.