Andrew Ng makes a comment that keeping the tanh or activation function within the “linear bounds” helps reduce overfitting helps reduce overfitting. (see screen shot below)
However, couldn;t you use many different lines to create complex boundaries using linear regression that would cause similar overfitting issues that a nonlinear function can cause? See example below.
In the screen shot, I basically take the over fitting sketch from the lecture (first screen shot above) and overlay multiple lines on top of the data sketch to create a complex boundaries to highlight my point. Just trying to understand intuition here because we’re being told that regularization keeps activation function within a linear bounds but visually you can create complex boundaries of data using many different linear regressions and overlay them on top of any data.
hope this makes sense. thank you for helping me develop better intuition on this.
This works because regularization reduces the magnitude of the weights. If you combine that with normalizing the features, the magnitude of X*w will stay close to the origin.
Using many different linear regressions might be possible, but at the cost of much increased computations.
@TMosh thanks for the reply. I’m not sure if I still understand after your explaination. Putting computational cost aside for the purposes of this discussion, the professor says that reason why we need activation function outputs close to the origin is to reduce model complexity (i.e. reduce variance) and get close to linear regression like model. But what I’m saying is that you can still use many nodes that are linear regressions activation functions and still create plenty of complexity. So I don’t quite understand his point as he mentions nothing about computational cost in the lecture I’m referencing. Many that is implied but I don’t quite understand. Maybe I’m overthinking this and simply that if you use linear regressions as the activation function (no relu, tanh, etc.) that you have Z outputs that are very large and so you have great variance that way in addition to the computation cost of having data ranges that are widely different from one layer to the next.
again, really appreciate the reply but I’m not seeing the connection between this and my question. not your fault just struggling with this concept a bit.
This gets into the realm of the ReLU activation function.
Andrew uses lots of informal arguments that give an intuition about ML processes. They’re not always mathematically well-founded. He does this because high-level math (like calculus) is not a pre-requisite for these courses.
If every layer is at the linear range, then the outcome will be just one linear boundary, instead of multiple linear boundaries. Andrew explained in Course 1 Week 3 Video " Why do you need Non-Linear Activation Functions?" that we will end up with just one linear formula when all layers’ activation behaves linearly.
Having just one linear boundary will explain that it can’t overfit.
In the above, we know the center of tanh is just linear-like but not absolutely linear, but for the sake of discussion, I have simply considered the condition of “being at the linear range” as “being equivalent to using a linear activation”.
Your picture of multiple linear boundaries can support the idea of overfitting, but a neural network doesn’t work that way, instead, as said above, it will end up being just one linear boundary.
On the other hand, what if you have 4 linear networks producing 4 linear boundaries? Then the next question is how you are going to piece them together for making one final decision instead of 4 boundaries making 4 decisions. If you piece them together with the fifth neural network, and that is still a linear network, it will still be just end up with one linear boundary. For example, you can write down the 4 boundaries as (1) y = 3x + 1, (2) y = x, (3) … (4) … Then, if you piece them together linearly, the final outcome will just be another boring y = mx + c.
To support your idea, you need to tell us how you are going to piece the 4 boundaries together to make one decision. You might realize that piecing them together means to merge them into one boundary, and that is going to be a nonlinear boundary, and thus the action of piecing up requires some nonlinear functions, which means that, to overfit, we need non linearity.
@rmwkwok This is a great explaination. THank you. The only piece I don’t fully understand is why you need a nonlinear function to “piece them together”. Any intuition there? I guess I’m envisioning a final layer with a different output (node) for each line.
I suppose we have both agreed that what we want now is one, final boundary for making one decision (instead of four) and it has to be a curved boundary. OK? GIven four linear boundaries (y_1, y_2, y_3, y_4), how are we going to get a curved boundary in result?
To piece them, what we have is just to add them together (and by weight), such as w_1y_1 + w_2y_2 + w_3y_3 + w_4 y_4. We have to realize that this is not going to result in a curved line. What else can we do? Before we add them up, we bend each of the linear boundary, because if we have bended them, we will be adding up four already curved boundaries, which, after being added up, will result in a curved boundary.
How do we bend them in prior? Apply non-linear activation function. For example, tanh, or relu, or any other non-linear function. y_1 is linear, but tanh(y_1) is curved. The only way we are going to get one curved boundary from four originally linear boundaries is to bend them first, and that’s why I said we need nonlinear functions.
What I said in the above is to emphasize on the inevitability of having non-linear activation function involved in the final model in order to have a curved boundary.
The tanh has a linear range, and it also has two non-linear ranges. If all layers are only able to use the linear range, the final boundary is going to be linear (which prevents overfitting). However, if all layers are in the non-linear ranges, the final boundary is going to be very flexible (volunerable to overfitting). Of course, we can have in-between.
However, we don’t pick. We don’t pick where we are, because the training algorithm is responsible for deciding where to settle down - the linear range, the non-linear range, or their borders. Without any regularization, the training algorithm is going to drive the model all the way to the non-linear ranges and overfit itself to the training data as best as it can, because it is its “objective” to do so - to the lowest cost.
With regularization, however, the story is different. Some regularization technique (L2) is good at suppressing the size of each weight, and if the weights are small, it is more likely to fall into the linear range of tanh! This is why altough we don’t pick (as we said above), we have a handle called regularization to indirectly achieve that.
Yes, I knew. Though you wrote z^[1], z^[2] which are actually for different layers, I have been assuming you were thinking about different output nodes of the final hidden layer.
Activation functions with outputs close to the origin can help to reduce variance by limiting the range of values that the model can produce. This is because the activation function squashes the output values towards zero. As a result, the model will be less sensitive to small variations in the input data.
In addition, activation functions whose outputs are close to the origin can help to make the model more efficient. This is because the model will have to learn fewer parameters if the output values are already close to zero.
So, although it is possible to create a complex model with activation functions whose outputs are close to the origin, the professor's idea is that these activation functions can help to make the model more accurate and more efficient.
@rmwkwok
first, I want to thank you for these thoughtful and detailed replies. it’s the mentors that make this course great. but I’m still struggling with this fundamental concept (though I’m getting closer to understanding with your help!). Let me ask this if I may:
if the purpose of using nonlinear activation functions is largely so we can create more complex (i.e. curved) boundaries for our data then why do we try and stay within the “linear” portion of the nonlinear activation functions? I understand one reason is to avoid exploding and vanishing gradients but I’m having a hard time developing an intuition around why many “tiny” lines when put together can’t do the job of creating a curved boundaries.
I understand your one reason is because it would be extremely computational expensive because you’d have so many more nodes representing each line segment that you’d have to stack together, but I guess I’m struggling with how these activation functions somehow do a better job at creating these boundaries especially if we are staying withing their linear range.
I do understand the intuition behind the idea of using the sigmoid or tanh to squash inputs into a 0 and 1 range or a -1 and 1 range for classification purposes but what I do not understand is how these activation functions somehow are allowing us to create boundaries for complex shapes in a more efficient manner compared to many lines stacked up together. For example the relu function doesn’t even have an upper boundary and if you were to take the part of the line that is greater than 0 it would effectively just be a line, too, and somehow this function (and others like tanh and sigmoid) is allowing us to create curved boundaries that many stacked together lines can’t. I feel like I am so close to get to this intuition I can “smell” it but still not there.
Oh wait! thinking through this some more could it just be that the each node and the wx+b activation function input is taking that specific feature and creating a line boundary for that one feature and then passing it into the sigmoid or tanh to create probability as to the chance that that feature is closer to being true or false of whatever the feature is being used to create a boundary for? If that’s the case then I think I think I might understand but not sure.
I think what would be helpful is to see how each individual node and each piece of data maps to a chart or graph in an interactive fashion and what the boundary looks like after each interation. I see the final boundaries after the model runs but what I’m missing is the intuition of how each piece of data and the resulting activation outputs gradually maps to a chart or graph that say divides red or blue concentric circles or something.
Wait could it just be as simple as this: The introduction of non-linear activation functions like sigmoid or tanh makes the process of forming curved boundaries even more straightforward. Instead of relying solely on the combination of multiple neurons to approximate curves, each neuron can directly contribute a curved segment of the boundary due to its non-linear activation? So while many many straight lines stacked together could form curved lines, using nonlinear activation functions (even if using the mostly linear part of the function) still has a “little curve” to it making it much more efficient that using many lines. Arghh! I’m trying here. Let me know if that intuition is spot on.
It’s not really linear, it is just more linear than the tails of the function where the gradients approach zero. The entire hidden layer activation is non-linear - otherwise it’s just linear regression and we don’t learn any new combinations of features.
I think you may be reading more into Andrew’s explanation than he intends.
On the extreme side of falling completely into that (almost-)linear region, the model is going to underfit the training data. On another extreme side, the model is going to overfit the training data. These are the risks.
We don’t try to just stay within that (almost-) linear range, because it’s going to underfit. In contrast, because there is that (almost-) linear range, it supports preventing overfitting.
You see, the regularization pushes us towards that (almost-) linear range, but the nature of optimization (overfitting to the training data) tries to pull us away from that. Those are the two forces being balanced, and hopefully, we will settle down at somewhere in-between, not either of the extremes.
Of course, all said above assumes we are dealing with a neural network that has the capacity to overfit. We won’t be talking about a network with just 2 layers and 2 neurons being trained by 100000 samples.
Here is a very visualizable, wonderful tool for you to get closer - the Tensorflow Playground - but how much closer will depend on you
For example, how would you connect the following polygonal decision boundary (not exactly curved, but is essentially curved) to the configuration of the neural network (e.g. the number of layers, the number of neurons, and the choice of activation function)?
It is really a wonderful tool for you to establish feeling, understanding, by trying different configurations yourself. It had helped me a lot!
You might want to stay with x1 and x2 as the only two inputs, as this makes sure any non-linearity only comes through the activation functions.
You might also want to play with it today for some time, establish something, and then play again on another day, and so on and so forth. In the early days, this tool had continued to inspire me.
I have a very strong feeling that you are getting real close. “contribute a curved segment” is very very close particularly for the case of ReLU. Because ReLU is piecewise linear, the “segment” effect is very visualizable. Please do try the tensorflow playground.
Thank you again, ahh I remember seeing this tool somewhere but appreciate you reminding me about it as I think you’re right this is a perfect time to dig in some more. Thank you!!
Re: screen shot below
input layer (x1, x2): what is the right intuition behind the numerical value that would represent image of feature x1 and x2? Would it be 0 or 1 depending if the line was horizontal or vertical?
hidden layer (n=1): the linear boundary created is that suppose to the line that represents W1x + b? or something else?
Output (n=1): what does it mean that this neuron is mostly white? Does it mean that the output is “negative” for the classification? I would expect that for the binary act function but this is RELU so not sure what a near-white neuron means.
General: why is it that if I increase W manually the border contrast increases?
First, you sent out the post before the image had been successfully uploaded, so I cannot see that image, but I am not asking for it, instead, I suggest you to do two exercises.
(preparation step) I suggest you to try a 1 layer 3 neuron settings. Train the model for enough time so it stablizes.
Exercise 1: Then pick a connection from one input to one neuron of the first layer, increase its weight at 0.1 per step, and see how the boundary of that neuron changes.
Exercise 2: Pick a connection from a neuron of the first layer to the output layer, increase its weight at 0.1 per step, and see how its signifiance in the final decision changes.
Not sure. I have not looked into the source code of the playground, but the exercise 1 might be suggestive.
Besides my two exercises, I also recommend you to brainstorm and try more yourself. My exercises base on very simple neural network configuration, focus ourselves on one kind of change at a time, experiment on that change in an incremental manner, and observe the tool’s responses.
Also, I use this tool to understand any relative effect due to my actions, instead of to understand the absolute meaning of everything in the tool. For example, I don’t really have the need to know how the graphs in the nodes are drawn, perhaps unless I had to reinvent the tool myself.
This is an interactive tool, so it is supposed for us to try and reason with ourselves. It would even be better if you could find someone to sit next to you, experiment, observe and discuss with you. This is an interactive tool, any intermediate action can affect observation at any moment, and it is not often very efficient to deliver all the necessary context in a forum.
There are also times that we want to just remember the observations even though they are not fully understood, and wait until the right time to come to finally get the answers.
This thing is supposed to take time, sometimes as much time as we promise to devote ourselves in Machine Learning.