Why sigmoid function is called probabilistic function?

The formula of naive probability is P(X) = \frac{favorable \ events}{total \ events}, this gives \begin{bmatrix} 0 & 1 \end{bmatrix} range

just because sigmoid gives same output we call it probabilistic function, why so?

Hi there,

you can interpret the value of the sigmoid function as probability: e.g. if the the sigmoid function returns a value of s(x) =0.5 you can interpret this as a 50% probability.

In the example of a classification between class A and B this would mean 50% probability that a certain input feature set belongs to class A.

For example this value of 50% can also be used as a threshold for a classification decision.

This thread might be interesting to take a look at: Decision boundary vs prediction f_wb in logistic regression - #9 by Christian_Simonis

Please let me know if this answers your question, @tbhaxor.

Best regards


Hi @tbhaxor great question!

As you can see the sigmoid function and the naive probability has some things is common, for instance both has range between 0 and 1 and the closer to 1 the most likely to occur the event. So, we can say that the sigmoid function represents an outcome between 0 and 1 and the closer to 1 is more likely to occur, that’s the main reason why we call it a probabilistic function.


I knew this would come up :smile:

So based on your answer, this is also a probabilistic function

def my_function(x):
     if x == 0: return 0.5
     return 1 / (1 + (1 / abs(x)))

if this is true, how exactly probability is calculated truly?

Same question to @Christian_Simonis

My exact question: Any function that has [0,1] range is a probabilistic function? According to me the answer should be NO, because take normalization function that also output between 0 and 1 and still it is not gives us probability distribution, but used as scaling the inputs.

I hope now my question would be clear.

Hi @tbhaxor,

here is my take on this matter:

No, a sin^2(x), where x as a any really number is not suitable to model a probability. Besides the obvious periodic structure (as a problem) also an other point should be mentioned:

To give you some guidance: usually to model probabilistic characteristics we use a probability density function (PDF). If you integrate over this PDF for the whole possible range per definition it should be equal to 1. So this isn’t the case in my sin^2(x) example or at least such a PDF is not possible to construct in a useful manner.

However it possible is for a logistic / sigmoid function: For a logistic function at least in approximation for a normally distributed feature, this characteristics is given as shown in the link I sent in my previous post above:

A (fitted well) logistic function can also serve as an „easy to compute“ approximation of the integrated gaussian probability distribution function (which describes a normally distributed feature)

If you are interested in generalising so much, I strongly suggest to carefully read the literature. Hope that helps!

Best regards

1 Like

I guess you mean normaliztion, assuming a gaussian normal distribution, right?

This normal distribution is a very popular probability density function (PDF), see my previous post.

Hope that helps! Please let me know if your question is answered, @tbhaxor!

Best regards

Hello @tbhaxor,

I have a feeling that even though your question didn’t mention logistic regression at all, some of our discussions here are actually based on the assumption that we were discussing sigmoid in the context of logistic regression.

First, I can’t find any reference that calls sigmoid a “probabilistic function” and I have never heard of that name (as an alternative name for Sigmoid) either, maybe it is not your intention to like give sigmoid a second name, but perhaps you are discussing it under the logistic regression context?

So I want to make the context explicit. And I do think the right context is necessary, otherwise we will fall into questions like is any function giving a range of [0, 1] is purposefully transforming something into probability. Purpose is important here.

I am going to make things short, but please do feel free to google for more explanations.

First, when we formulate a logistic regression problem, we are trying to find a set of model parameters (let’s call it \theta) that maximizes the probability of observing the training data (googling keywords: Maximum Likelihood Estimation). For example, if we have a training dataset of 5 samples (y_0, y_1, y_2, y_3, y_4), and their labels are 0, 1, 1, 0, 0. Then we want to find a set of model parameters so that the probability of observing \hat{y}_0 =0, \hat{y}_0 =1, \hat{y}_0 =1, \hat{y}_0 =0, \hat{y}_0 =0 is maximized, where \hat{y} is a model’s prediction.

Given the assumption that all samples are indepentent of each other, we can then write down
P(\hat{y}_0 =0, \hat{y}_0 =1, \hat{y}_0 =1, \hat{y}_0 =0, \hat{y}_0 =0 | \theta) = P(\hat{y}_0 =0 |\theta) \times P(\hat{y}_1 =1 |\theta)\times P(\hat{y}_2 =1 |\theta)\times P(\hat{y}_3 =0 |\theta)\times P(\hat{y}_4 =0 |\theta)

Since it is a binary problem, we can assume that P follows the bernoulli distribution. The distribution has one parameter p and by definition p is a probability, and that parameter p is also what exactly the logistic regression is formulated to predict for.

Putting things together, the \theta is basically the training parameters of the logistic regression (like w_1, w_2, b if we have a 2-feature problem), and we are using the sigmoid to convert the outcome of w_1x_1 + w_2x_2 + b to produce that p.

Since p, by definition, is a probability, and we use sigmoid to produce the p, that’s how we link up the connection between probability and sigmoid. However, if we get rid of the context of logistic regression, then we get rid of that connection as well.

So if you ask why sigmoid is related to probability, then my answer is that the relation lies behind that connection under the context of logistic regression.



I think it’s quite apparent, e.g. if a sigmoid is used for describing populations growth, obviously it does not make sense to interpret it as probability but rather as a curve fit…
I would be curious which other applications of sigmoid you had in mind, @rmwkwok?

(Reason being: in the discussed context or AI I guess all applications like sigmoid activation, image segmentation, probability of a certain action or event … can be broken down to the meaning of logistic regression w/ range or [0…1] where the returned value can be interpreted as a probability estimate].

Best regards

So the context is important, here I think you are using sigmoid like this population = C \times sigmoid(vt) where C is a scaling constant and v is the speed of growth. Then ofcourse it is, as you said, a curve fitting (for parameters C and v), and sigmoid has nothing to do with probability.