Can logistic regression be replaced with ordinary linear regression

Hello everyone,
Thanks for the wonderful course.
I tried implementing logistic regression to some practical binary classification problems and observed that it works no better than simple linear regression where the desired output (Y) equals either to 1 or 0. After the regression coefficients are computed, I can calculate the probability by applying sigmoid function: P=sigmoid ((w,X)+b).
So, my question is: what is the benefit of following the logistic regression algorithm with its specific loss function over the “simple” linear regression?

Will be grateful for the answer.
My best regards,
Vasyl,
Kyiv, Ukraine

1 Like

Hey @vasyl.delta,
Welcome to the community. I suppose we are clear on the fact that Logistic Regression is nothing but Linear Regression + Sigmoid, at least as far as the the way of predicting the target values is concerned.

So, your query boils down to “What’s the advantage of using logistic regression with the logistic loss function instead of using it squared error cost function?”. In Week 3 of this course, you will find that Prof Andrew describes this advantage in the video entitled “Cost Function for Logistic Regression”. When you use Squared Error Cost function with Logistic Regression, the cost function that you get maybe a non-convex function, which could have multiple local optima, but when we use the logistic loss function, the cost function that you will get is a convex function, i.e., a single optima.

Now, you definitely won’t want only a locally optimal solution, hence, we stick with the logistic cost function. Now, coming to your practical results, it might be possible that in your case the difference between a locally optimal solution and the global optimal solution is negligible, and hence, you don’t see any difference between the 2.

I hope this makes sense. Rest, you can share your results here as well, so that it helps the other learners as well, who may stumble upon this query.

Cheers,
Elemento

5 Likes

Hi @vasyl.delta

In general, logistic regression is preferred over ordinary linear regression for classification tasks where the dependent variable is binary (e.g., “yes” or “no”). This is because the logistic function is better suited to modeling binary outcomes, as it maps any input value to a value between 0 and 1, which is suitable for modeling probabilities.

However, it is possible to use ordinary linear regression for classification tasks by transforming the dependent variable into a binary form. For example, you could use linear regression to predict a probability of a binary outcome (e.g., the probability that a customer will convert on a website), and then use a threshold value (e.g., 0.5) to classify the outcome as “yes” or “no”.

While this approach may be sufficient for some classification tasks, it is generally not recommended as it does not take into account the specific characteristics of the binary outcome. In particular, the linear regression model may produce probabilities outside the range 0 to 1, which is not meaningful in the context of a binary outcome.

In summary, while it is possible to use ordinary linear regression for classification tasks, logistic regression is generally preferred because it is specifically designed for modeling binary outcomes and is more powerful and flexible in this context.

2 Likes

Thank you very much for the answer! I will revisit the mentioned lecture.
My sincere regards,
Vasyl

1 Like

Thank you very much for the answer! So valuable to have feedback. Would be very interesting to fins the example where binary logistic regression is really more effective in comparison with “binarized” linear regression.
Vasyl

1 Like

Imagine that you are working for a company that sells products online and you want to build a model to predict whether a customer will make a purchase (i.e., a binary outcome). You have data on the following features for each customer:

  • Age
  • Income
  • Gender
  • Education level

You decide to use both linear regression and logistic regression to build models to predict the probability of a purchase based on these features.

For the linear regression model, you first train the model using the age, income, gender, and education level features. The output of the linear regression model is a continuous value, which could take on any value within a given range. To transform this value into a probability, you apply the sigmoid function to the output of the linear regression model. The resulting probability can then be used to make predictions about the likelihood of a purchase.

For the logistic regression model, you also train the model using the age, income, gender, and education level features. The output of the logistic regression model is a probability that can be used to make predictions about the likelihood of a purchase.

Now, let’s compare the outcomes of the two models:

  • For a given customer, the linear regression model might predict a continuous value of 0.6, which would be transformed into a probability of 0.65 after applying the sigmoid function. The logistic regression model might predict a probability of 0.70 for the same customer.
  • For another customer, the linear regression model might predict a continuous value of -0.3, which would be transformed into a probability of 0.42 after applying the sigmoid function. The logistic regression model might predict a probability of 0.45 for the same customer.

Overall, you can see that the outcomes of the two models are similar, but the logistic regression model tends to predict slightly higher probabilities compared to the linear regression model with the sigmoid function applied at the end. This is because logistic regression is specifically designed to predict binary outcomes and may be more effective at modeling this type of data compared to linear regression with the sigmoid function applied at the end.

1 Like

Hey @pastorsoto,
Don’t you think that the difference in the outputs of the 2 models will be due to the difference in the cost functions only, since I believe, that there is nothing “inherent” as such apart from the cost function that makes logistic regression different from the linear regression, assuming that we use the sigmoid function on the output of linear regression, as stated in the query?

I believe, that if we use sigmoid on linear regression’s output, and say the same cost function, either the squared error or logistic loss to train both the models, then there should be no difference in the outputs of the 2 models, at least from the mathematical point of view. The way the libraries implement these 2 models may cause certain differences, but if we keep all the factors same, shouldn’t the 2 models produce the same outputs?

Cheers,
Elemento

1 Like

It could be, but you also need to change other things, such as the model architecture and the optimization algorithm.

The architecture of the linear regression is:

\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n

Which captures linear relationships

While the architecture of the logistic regression is:

\hat{y} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}}

This is more complex and includes multiple non-linear transformations

The optimization algorithm use to minimize the loss function are also different. Linear regression typically uses gradient descent, while logistic regression use stochastic gradient descent.

I think all three factors contributes to the difference between both outputs (Architecture, Cost function, Optimization algorithm) perhaps some problem you will be able to replicate the results, but others might not.

The best way to know is to test it!

Great discussion!

Hey @pastorsoto,

Just to confirm, the only difference is in the use of Sigmoid function here, isn’t it? And we are assuming that we are using Sigmoid function on the output of Linear Regression, so don’t you think that there will be no difference in the model architecture any more?

Also, as far as the optimizing algorithm goes, they are inter-changeable aren’t they? The choice of whether to use GD or SGD, depends on us, and not on the choice of model, doesn’t it?

So, don’t you think the remaining factor is the cost function only?

Cheers,
Elemento

Yes. But if you change the cost function it won’t be enough to reproduce the results, you also need to change the architecture and the optimization. So changing just the cost function is not enough. If you change the first two yes, you have the cost function only, but if you change the cost function only you also need to apply the transformation and the optimization algorithm.

Thank you very much!
Just to catch the opportunity to ask: I do not think that we can apply directly sigmoid function to the result of prediction by linear regression.
I mean, if linear regression produced prediction y=-0.3, sigmoid(y)=0.42 is not truthful probability for the output, since y=-0.3 is quite a strong suggestion that the output is zero. Rather, we should transform y at least to SIGNED value, e.g. 2y-1=-1.7 and then get sigmoid(2y-1)=0.17.

You may take the output of your linear regression model as a feature, and fit that feature to a logistic regression model, then use the trained logistic regression model to do that transformation.

Thank you, so interesting! Will try. Is it a common practise to use such a feature for logistic regression?

Here is my guess: it is not common to model a binary classification problem with a linear regression, but if you do, it is a solution to use logistic regression to model the outcome of the linear regression.

:wink:

Raymond

1 Like

As I understand, the simple (non-logistic) linear regression will have advantage over logistic regression in the case of non-binary classification (for instance, when we have to classify onto three categories). (Non-logistic) linear regression easily encompasses any number of possible “y” values, whereas for logistic regression… I do no know at the moment, but the binary logistic regression is not generalized onto the case of more than two decisions.

Will be grateful for the comment.

My sincere regards,

Vasyl

Kyiv, Ukraine

Hi @vasyl.delta

thanks for your question!

Regarding the statement:

the simple (non-logistic) linear regression will have advantage over logistic regression in the case of non-binary classification (for instance, when we have to classify onto three categories)

I think in general this statement is not true. As a counter example, let’s assume you want to build an early warning system to determine whether an information is correct or not [you could also think of an anomaly detection].

You could go for logistic regression, use the probability p for a false information and defined thresholds to derive a multi classification (similarily what you would do with a linear model as far as I understood), e.g. with the following logic:

  • p < 0.2 —> OK: seems to be fine
  • 0.2 \leq p < 0.7 —> suspection: some action is recommended like getting an expert review
  • p \geq 0.7 —> NOT OK: information seems to be wrong.

This approach might not necessarily perform worse than a linear regression model which you also would use for threshold based conclusions, right?

I would say the model suitability really depends on the business problem and how well a certain model fits a data set and can generalise on it.

Here some threads that you might find useful to check out:

As @rmwkwok pointed out, I would also underline:

it is not common to model a binary classification problem with a linear regression

Same applies also for a multi class problem:

Suggestion: In this case feel free to check out multi class algorithms. E.g. here you can find several really powerful models for that purpose.

Please let me know if this helps.

Все найкраще & all the best, Vasyl!

Best regards
Christian

1 Like

Thank you so much for a quick answer, Christian! Дякую :slight_smile:
Will look through links kindly sent by you.
Could you please tell what is meant by: “probability p for a false information”
From the course I understood that logistic regression model provides just the probalilities of “class=0” and “class=1”. In which way is mentioned probability p is connected to them?

1 Like

Hi there,

sure: it’s just the definition.
What I meant was basically the definition by cases which was the example from the previous post that if p = 1 this would semantically correspond to that the model provides 100 % probability for a failure case as output, see also this visualization (right side of plot):


Note: if you have multi class labels, also a multi class logistic regression could be applied, e.g.

  • if you solve a binary problem by fitting the model for each label
  • or alternatively if the loss minimised is the multinomial loss fit across the entire probability distribution,

see also:

Best regards
Christian

Thank you for the detailed answers, Christian! Seems to be clear now.

1 Like

Dear Christian,
Regarding this point:
“if you solve a binary problem by fitting the model for each label”
Could you please make it more clear.
Suppose, I want to classify variable X into three classes: A, B or C.
Does this mean that I make three binary classification tasks:
X belongs to A or X belongs to (B+C)
X belongs to B or X belongs to (A+C)
X belongs to C or X belongs to (A+B)
and then take a maximum among 3 obtained probabilities to choose a specific class?
Will be grateful to you for the answer.
Vasyl.