Within the logistic regression algorithm, why do we use the output of the linear regression model, i.e., f(x)=wx+b, as the input of the sigmoid function? Why don’t we just directly input the feature itself into the sigmoid function? Can anyone explain this with intuition?

Because f(x) represents a linear combination of the input features (captures weighted sum of the features). This linear combination reflects how different features contribute to the prediction. Directly inputting the features into the sigmoid function doesn’t handle each individual contributions and interactions in a linear, separable way. The sigmoid function then makes this linear combination a probability value between 0 and 1, making the results ready for a binary classification.

Hope it helps! Feel free to ask if you need further assistance.

thanks for your response, but I already got the exact response from ChatGPT. Still, It doesnt make sense to me. Its kind of confusing to mentally visualize, like what is meant by a weighted sum of features. I mean, its all fuzzy until I see it in a graph or visual.

As Alireza said, the function, f(x), tries to measure the effects of changes in x, on y. Just as we do in simple linear regression.

And since this is a linear combination of variables, with no constraints, the output will belong to ( - infinity, infinity). I’m side-stepping a few mathematical concepts here to keep it concise.

This is fine in regression setting when we’re trying to predict a numerical variable. But in logistic regression, we’re predicting the probability of a class, and therefore, we want our result to be within [0,1].

Therefore, the problem we are trying to solve for is to measure the effects of X on Y, while making sure at the same time that the result is between [0,1]. We do this in 2 steps to facilitate the computation.

So, in linear combination, we get :

f(x)= Y = wx+b

Then to get the probabilistic result, we use the sigmoid function, which maps a result in (-inf, inf) to (0,1).

sigmoid (Y) = 1 / ( 1 + exp(-Y) )

So, as you see, the feature is itself in sigmoid function, but used indirectly.

A very good explanation by Nick!

To add, features have different effects on the outcome, so we use a linear combination ( f(x) = wx + b ) to capture these different contributions (e.g., some words have a greater impact on spam email detection than others). Inputting this combined value into the sigmoid function gives us the probability needed for binary classification. Directly inputting the features into the sigmoid function can lead to poor results, as there is nothing to learn from the data (the sigmoid function has no learnable parameters).

I’ll try to find visual examples of logistic regression if you’re still confused!

Well, does linear regression make sense to you? There you take the linear combination of the weights and input features to compute a final output, which is just a real number between -\infty and \infty, e.g. the predicted price of a house or stock price or the temperature at noon tomorrow. The weights and bias value are learned by the model based on our training data and we hope that the training works and we get good predictions.

Well, in logistic regression what we are trying to produce is a “classification” instead of an output real number. For example, does a picture contain a cat or not? So what we do is take the same linear combination we used in linear regression and convert it into the probability of a “yes” answer by feeding it to the sigmoid function. We need a single “yes/no” answer, which is why we can’t apply sigmoid to the individual features. Once we have defined the model in this way, then we train it based on our training data (e.g. pictures with “cat” and “not a cat” labels).