Decision boundary vs prediction f_wb in logistic regression

Hi,
I struggle to understand how decision boundary is different from sigmoid function.
In the example
f_{\vec{w},b}(\vec{x})=g(w_1x_1 + w_2x_2 + b) where w1 = 1, w2 = 1, b = -3,
x1 + x2 = 3 so the decision boundary is a straight line on a graph of x2 plotted against x1, f(x1) = -x1 + 3. But g(z) is a sigmoid function. So how the heck are we talking about a linear equation? I am utterly missing the abstraction transition here.

If I am to calculate what is the probability that a point is within the boundary of f_wb(x), and this probability is expressed with a sigmoid function 1/(1+e^-(wx+b)) then why such formula applies here? Where does the 1/(1+e^-(x)) come from? I don’t know how I would calculate the probability but I suppose that if I did, I’d end up with this formula.

Does sigmoid function as a probability that a point is within a certain decision boundary, apply to any prediction f_wb, no matter the vectors w and x?

The lack of calculus in the course makes me confused.

I would really appreciate if you could help me.

Hi @neural_ghost ! Cheer up! it can be confusing. Let me try to explain this:

First of all the easy, starting point: Decision boundary is one thing, and sigmoid is another thing.

How are they different?

To put it in simple words: The decision boundary is the ‘fence’ that separates one class from the other (imagine a physical fence that separates the 2 classes of samples, class A and not-class-A, one on each side of the fence), while the sigmoid function is the ‘tool’ that tells you if one sample is on one side of the fence, or the other side of the fence (this tool will take each sample and put it either on one side or the other side of this imaginary fence).

On another note, good news is: this is not calculus but just algebra :slight_smile: so don’t despair.

@neural_ghost if you’d like to dig deeper on this, after understanding in simple terms how these 2 concepts are different, please don’t hesitate to reach out. I do recommend watching again the video on decision boundary and follow Dr Ng very closely, now that you know what he’s talking about.

Cheers,

Juan

2 Likes

“linear equation” doesn’t refer to its shape when you plot it.

It refers to the equation using the linear combination of the features and weights.

The sigmoid() is only an activation function, which re-scales (non-linearly) the output to a range limited set of values between 0 and 1.

Calculus is only used in computing the equation for the gradients of a cost function (via the partial derivative). As the math is too advanced for the target audience for this course, the derivations aren’t included. They are easy to find online though.

Hello Adam @neural_ghost,

Agreed.

It is not linear :wink:

You can draw the line x_1 + x_2 - 3 = 0. We call the line a boundary line because it separates the space into two sides, right? And the two sides are actually

  1. anything above the line, or x_1 + x_2 - 3 > 0, and
  2. anything below the line, or x_1 + x_2 - 3 < 0

Equivalently, we say it a boundary line that separates any (x_1, x_2) points that satisfies z>0 from any other points that satisfies z<0.

Since they are inequalities, they are not two lines, but two “spaces”.

We know that:

  1. when z>0, g(z) > 0.5
  2. when z<0, g(z) < 0.5
  3. when z=0, g(z) = 0.5

Therefore, if a point (x_1, x_2) stands on the boundary line, its g(z) should be evaluated to 0.5. If another point stands quite next to the boundary line but above the line, then g(z) is slightly larger than 0.5. Furthermore, if a point stands far above the boundary line, then its g(z) of such point is much closer to 1.

g(z) = \frac{1}{1+e^{-z}} provides a good functional form to convert unlimited z to a range between 0 and 1. For example, as z is very very large, g(z) is just 1 and it can’t ever be larger than 1. To bound an unbounded z, we need a non-linear g(z) to do the job.

Cheers,
Raymond

3 Likes

Hmm, alright. I thought that the equation of sigmoid function is derived analytically by calculating the integral of z.

Ok so my question then would be, why sigmoid function - yes it fits and fulfils the needs of rescaling z in a way that is useful. But how did the mathematicians come up with it for logistic regression? As in, what is the analytical explanation of the exact formula 1/(1+e^-(wx+b))? It is a huge deal to me to understand the basics well.

Also thank you all for answers!

Hi @neural_ghost ,

Sigmoid function outputs a probability distribution between 0 and 1. Logistic Regression is a binary classifier; it cares about whether something is “true or false”. For example, if your model is a cat classifier, using Sigmoid as an activation function in the output layer would help to identify if an image is “cat or not-cat”.

Raymond gave a very good explanation on g(z) = 1/1+e-^z. The formula 1/1+e^-(wx+b) is just setting z = wx+b.

Alright, but a similar effect could be achieved with different functions, I would suppose some combination of logarithmic and root functions. And sigmoid function itself has many types with similar properties, that differ in formulas. So I am wondering if the exact formula g(z) = 1/(1+e^{-z}) is the most precise one in all circumstances, analytically.
Ok but the properties suit the purpose and the formula is very simple so no doubt it’s useful.

great explanation thank You

1 Like

@neural_ghost:

As some additional remark to the very good answers: the popularity of the logistic function practitioners lies in the effectiveness and simplicity for classification purposes, since it is:

  • differentiable (w/ non negative derivative)
  • bounded
  • defined for all real numbers as input
  • serving w/ numerical benefits in NN layers, …
  • a nice way to interpret the dimensioned threshold (corresponding to a probability) in combination with some other metrics:
    ROC-Kurve – Wikipedia

As a side note: (fitted well) the logistic function can also serve as an „easy to compute“ approximation of the integrated gaussian probability distribution function (which describes a normally distributed feature). You might find this older article worth a read: A Sigmoid Approximation of the Standard Normal Integral:

Most probability and statistics books, […], present the normal density function with the standard normal transformation and give a tabu- lation of cumulative standard normal probabilities. Reference is commonly made to the fact that the probabilities are obtained by integrating the normal density function. However, because the integration of the normal density function cannot be done by elementary methods, various approxi- mations are used to determine cumulative standard normal probabilities.

See also: logit - Is the first derivative of the logistic probability function a Gaussian function? - Cross Validated

4 Likes

I don’t disagree that similar effect could be achieved with different functions, and therefore I also wouldn’t say sigmoid has to be the most precise one. Speaking of preciseness in the lack of prior knowledge about the problem is not quite wise. However, if you have background in statistical mechanics or information theory, you may want to google these keywords “lagrange multiplier” and “principle of maximum entropy” which will lead you to discussion that proves softmax (deducible to sigmoid) is the solution distribution which “maximize the system’s entropy” or “is the solution that without additional prior information”. This does not mean softmax (or sigmoid) is the most precise answer, but it means softmax is a good default to use if you know nothing more of the system.

Some references for you, please google more for yourself:

3 Likes

Amazing! Thank you so much!

2 Likes