Hi,
I struggle to understand how decision boundary is different from sigmoid function.
In the example f_{\vec{w},b}(\vec{x})=g(w_1x_1 + w_2x_2 + b) where w1 = 1, w2 = 1, b = -3,
x1 + x2 = 3 so the decision boundary is a straight line on a graph of x2 plotted against x1, f(x1) = -x1 + 3. But g(z) is a sigmoid function. So how the heck are we talking about a linear equation? I am utterly missing the abstraction transition here.
If I am to calculate what is the probability that a point is within the boundary of f_wb(x), and this probability is expressed with a sigmoid function 1/(1+e^-(wx+b)) then why such formula applies here? Where does the 1/(1+e^-(x)) come from? I donât know how I would calculate the probability but I suppose that if I did, Iâd end up with this formula.
Does sigmoid function as a probability that a point is within a certain decision boundary, apply to any prediction f_wb, no matter the vectors w and x?
The lack of calculus in the course makes me confused.
Hi @neural_ghost ! Cheer up! it can be confusing. Let me try to explain this:
First of all the easy, starting point: Decision boundary is one thing, and sigmoid is another thing.
How are they different?
To put it in simple words: The decision boundary is the âfenceâ that separates one class from the other (imagine a physical fence that separates the 2 classes of samples, class A and not-class-A, one on each side of the fence), while the sigmoid function is the âtoolâ that tells you if one sample is on one side of the fence, or the other side of the fence (this tool will take each sample and put it either on one side or the other side of this imaginary fence).
On another note, good news is: this is not calculus but just algebra so donât despair.
@neural_ghost if youâd like to dig deeper on this, after understanding in simple terms how these 2 concepts are different, please donât hesitate to reach out. I do recommend watching again the video on decision boundary and follow Dr Ng very closely, now that you know what heâs talking about.
âlinear equationâ doesnât refer to its shape when you plot it.
It refers to the equation using the linear combination of the features and weights.
The sigmoid() is only an activation function, which re-scales (non-linearly) the output to a range limited set of values between 0 and 1.
Calculus is only used in computing the equation for the gradients of a cost function (via the partial derivative). As the math is too advanced for the target audience for this course, the derivations arenât included. They are easy to find online though.
You can draw the line x_1 + x_2 - 3 = 0. We call the line a boundary line because it separates the space into two sides, right? And the two sides are actually
anything above the line, or x_1 + x_2 - 3 > 0, and
anything below the line, or x_1 + x_2 - 3 < 0
Equivalently, we say it a boundary line that separates any (x_1, x_2) points that satisfies z>0 from any other points that satisfies z<0.
Since they are inequalities, they are not two lines, but two âspacesâ.
We know that:
when z>0, g(z) > 0.5
when z<0, g(z) < 0.5
when z=0, g(z) = 0.5
Therefore, if a point (x_1, x_2) stands on the boundary line, its g(z) should be evaluated to 0.5. If another point stands quite next to the boundary line but above the line, then g(z) is slightly larger than 0.5. Furthermore, if a point stands far above the boundary line, then its g(z) of such point is much closer to 1.
g(z) = \frac{1}{1+e^{-z}} provides a good functional form to convert unlimited z to a range between 0 and 1. For example, as z is very very large, g(z) is just 1 and it canât ever be larger than 1. To bound an unbounded z, we need a non-linear g(z) to do the job.
Hmm, alright. I thought that the equation of sigmoid function is derived analytically by calculating the integral of z.
Ok so my question then would be, why sigmoid function - yes it fits and fulfils the needs of rescaling z in a way that is useful. But how did the mathematicians come up with it for logistic regression? As in, what is the analytical explanation of the exact formula 1/(1+e^-(wx+b))? It is a huge deal to me to understand the basics well.
Sigmoid function outputs a probability distribution between 0 and 1. Logistic Regression is a binary classifier; it cares about whether something is âtrue or falseâ. For example, if your model is a cat classifier, using Sigmoid as an activation function in the output layer would help to identify if an image is âcat or not-catâ.
Raymond gave a very good explanation on g(z) = 1/1+e-^z. The formula 1/1+e^-(wx+b) is just setting z = wx+b.
Alright, but a similar effect could be achieved with different functions, I would suppose some combination of logarithmic and root functions. And sigmoid function itself has many types with similar properties, that differ in formulas. So I am wondering if the exact formula g(z) = 1/(1+e^{-z}) is the most precise one in all circumstances, analytically.
Ok but the properties suit the purpose and the formula is very simple so no doubt itâs useful.
As some additional remark to the very good answers: the popularity of the logistic function practitioners lies in the effectiveness and simplicity for classification purposes, since it is:
differentiable (w/ non negative derivative)
bounded
defined for all real numbers as input
serving w/ numerical benefits in NN layers, âŚ
a nice way to interpret the dimensioned threshold (corresponding to a probability) in combination with some other metrics: ROC-Kurve â Wikipedia
As a side note: (fitted well) the logistic function can also serve as an âeasy to computeâ approximation of the integrated gaussian probability distribution function (which describes a normally distributed feature). You might find this older article worth a read: A Sigmoid Approximation of the Standard Normal Integral:
Most probability and statistics books, [âŚ], present the normal density function with the standard normal transformation and give a tabu- lation of cumulative standard normal probabilities. Reference is commonly made to the fact that the probabilities are obtained by integrating the normal density function. However, because the integration of the normal density function cannot be done by elementary methods, various approxi- mations are used to determine cumulative standard normal probabilities.
I donât disagree that similar effect could be achieved with different functions, and therefore I also wouldnât say sigmoid has to be the most precise one. Speaking of preciseness in the lack of prior knowledge about the problem is not quite wise. However, if you have background in statistical mechanics or information theory, you may want to google these keywords âlagrange multiplierâ and âprinciple of maximum entropyâ which will lead you to discussion that proves softmax (deducible to sigmoid) is the solution distribution which âmaximize the systemâs entropyâ or âis the solution that without additional prior informationâ. This does not mean softmax (or sigmoid) is the most precise answer, but it means softmax is a good default to use if you know nothing more of the system.
Some references for you, please google more for yourself: