Logistic Regression vs Linear Regression in using MSE

Hi! I am loving the course, I just finished week 2’s programming assignment and a doubt crept in my head. I read a post here on discourse that there is a “geometric” interpretation" of what Logistic Regression is doing: it is finding a hyperplane in the input space. This sounds a lot like linear regression to me, where you find a line best fit.

The confusion arises when I think of the optimal loss functions for the two. For binary classification/logistic regression, we use Binary cross-entropy, whereas for linear regression we use Mean squared error. I understand why each works for each. My question is why are they different if both are finding a line.

As I write this question, I seem to realize logistic regression is a line as a decision boundary where as linear regression is a line best fit, which makes them completely different problems and I am only confused and relating them because of the name and term regression? Is that understanding correct?

Side doubt(And sorry for the long post), why can Logistic regression only learn linear decision boundaries if we have a non-linear activation function i.e sigmoid.
I appreciate any feedback. Thanks!

1 Like

Hi Jaskeerat,

Your intuition, I think, is good. They are not so different: ordinary linear regression and logistic regression. Let’s tackle that first.

Consider the simple linear function: y1(i) = w1*x1(i) + b where y(i) is the i-th example of the target variable and x(i) is the corresponding (explanatory) feature value for the i-th example. WIth gradient descent, we are attempting to “learn” the weights of w and b that minimize the cost (the average loss) between the predicted value (y-hat) and the observed value y. Forget about the loss function for now, not important (but MSE cost function, nevertheless).

Go back to the house-price example of week 1. One aim was to use the learned values of w and b to predict the value of some other house not in the training set (say, one that you may want to buy or sell) based on a number of features (e.g. sq feet, # of bedrooms). Pick one feature for simplicity (as above): size measured in square feet.

Now visualize the scatter-plot in x1-y space. The regression line is the one (defined by values of w1 and b) that minimizes the average loss. That function, that line, is the regression line. Stick a pin in that! That is one of your lines that you are trying to keep straight in your head. The other is the decision boundary, which we haven’t talked about that yet.
Now let’s add another feature in addition to house size (x1). Say, number of bedrooms (x2). Now we have:

$y(i) = w1 * x1(i) + w2 * x2(i) + b

Gradient descent now learns w1, w2, and b by minimizing the average loss (as before). But now the regression “line” is a two-dimension object living in a three dimensional space, so now its visualized as a plane. Conceptually though, it is the “line” that we stuck the pin in above.
Suppose you want to know about the values for the features that make a house very expensive, say at least $1mil. So fix y = y* where y* = $1m. You are interested in the set of (x1, x2) that predict a house will be >= y*:

(x1, x2) such that: w1x1 + w2x2 >= y*

So now you ask, what are the values of x1 and x2 that separate the expensive houses from the not-so-expensive ones. (This is a Silicon Valley perspective, by the way :hushed:). Set the left-hand side of the the above expression equal to the right-hand side. Since y is fixed at y*, you now have an expression (a line!) in x1 and x2. Solve it for x2 and plot it in x1-x2 space. That line is your decision boundary. Points (x1, x2) that lie above that line predict an “expensive house”, those below, a not-so-expensive house. So there’s that other line that has been troubling you. Mathematically, it defines a level set.

Note well: We have not even talked about logistic regression yet.

Exercise: Describe the decision “boundary” in the simple regression (one feature) case.

Exercise: Suppose that you wish to use a regression equation that predict the proportion of voters who will vote for one of two candidates for President in a national election. Note that y must be contained in the interval [0, 1]. If the number of voters is so large as to be essentially infinite, the range of y is (0, 1). Majority ruleapplies: A candidate must at least achieve y = 0.5 to win, i.e. y* = 0.5. I propose the following linear regression model (so, no new concepts!)

log((1-y)/y) = w x + b

Try the following: (1) Explain why this is simply a linear regression model. Hint: A data series z = log((1-y)/y) could be computed. (2) Describe/interpret the argument to the natural log function (It helps if you go to the horse races and think about y as a probability). (3) Solve in terms of y. What do you have? (4) Think about how your manipulations in solving for y might have changed (transformed) the nature of the MSE cost function. (5) What might your dataset look like? identify two appropriate features and compute the decision boundary in (x1, x2) space.

Lastly, it’s OK to think about y as a probability (y=p).


@kenb Hey, thanks for your response. Your questions are really making me think. I’d appreciate it if you checked my understanding again.

Exercise1: Describing Decision boundary in one feature case considering the same house example and x being size.
Decision Boundary will be a ‘point’ on the x line which solves the equation wx+b=y* where y* is 1m$.

Exercise 2:
Let x(one feature affecting proportion of votes) be money invested in campaigning. Then with a linear regression model, I can say wx+b=y where y is proportion, but since I want to contain it between 0 and 1, I apply sigmoid giving me sigmoid(wx+b)=y.
Using properties of log and exponents I derived: -log((1-y)/y)=wx+b, but I can take the negative to the other side and adjust it into w and b because they can be anything giving me log((1-y)/y)=wx+b. Hence it’s simply a linear regression model that we have applied sigmoid to to scale y b/w 0 and 1.

Okay so basically linear regression is y=mx+b and logistic regression is y=sigmoid(mx+b), that’s the only difference in the functions?

Now I am thinking about what in this transformation affected the MSE function.
But I can’t seem to work out mathematically why applying sigmoid should suddenly make MSE a bad choice. Is it because the derivative of y=wx+b i.e linear regression is simply w but for y=sigmoid(wx+b) you get sigmoid(wx+b)*(1-sigmoid(wx+b))*w and when you equate those to 0 to find the maximum and minimum, for linear regression there is just one value for the weights but for logistic with the complex derivative you can now get derivative 0 for multiple combinations of weights?

Also, follow up question, if I was to now think of decision boundaries in logistic regression and consider Week 2 where we programmed logistic regression to classify ‘cat’ or ‘not cat’. That had 64 * 64 * 3 input features, the decision boundary would be 0.5=sigmoid(w1x1+w2x2…+b) which is nothing but log(1)=w1x1+w2x2+wnxn(where n is subscript and n=64 * 64 * 3)+b which is nothing but well a line/hyperplane in the input space of 64 * 64 * 3 dimensions.
Is this understanding correct?
And if it is correct, how could there possibly exist a relation like this able to classify a cat or not a cat and how were we able to get this kind of a model to perform such a classification.

I think you are doing great, but lets see where we are. First off, correct! The decision boundary would be a point on the real line in the single-feature case in exactly the manner you suggest (for simple regression).
For the rest of it, I think you have it but let’s make sure. First, we’ll agree on the notation

z = wx + b

for a linear regression for the non-vectorized (i.e., single-example but not necessarily single-feature) expression. Typing out a transpose is difficult but here is the nice typeset expression:


Ok, onward. Suppose that


Note well: We are only pretending that y is a continuous variable between 0 and 1, like a probability. In logistic regression, y is instead discrete belong to {0,1} is in {non-cat, cat}.

If we think of y as a probability, the argument to the log function in the latter expression is called the odds-ratio. Example: The money placed down at the race track imples that Seabiscuit has a 25% chance of winning the race (y = .25). The “odds” on Seabiscuit winning are “3-to-1”. Every dollar bet, return 3 dollars.

As you properly figured out:


Looks familiar, right? Now, you can mentally finesse these relationships for some intuition behind the sigmoid function. My hint/excercise suggest that (for example) if had the historical data on the proportion of the U.S. major two-party vote going to each of the party’s candidate, then we could form the data series z as the negative of the log of the odds-ratio and run the reqression (learn the weights and the bias). (If I had to choose just one x feature it would be the year-over-year percent change in real disposable income in the year prior to the election. Works well! I like your idea of campaign dollars too!)

Takeaway: Logistic regression has a simple regression basis if you think of y as a probability-like quantity.

Difficulties arise when y is like that and can only take on either the value of 0 or 1 (binary classification!). As you see, the log of the odds-ratio is not so well-defined and using MSE loss is not a great idea (due to nonconvexities). Since it is way more than we need, the exercise of deriving the sigmoid function from the negative of the log odds-ratio was to suggest insight into how the binary cross-entropy cost function may arise–one that properly addresses the nonconvexities that arise in classification problems. I recommend the optional video on that as well.

As for your follow up, yes the decision boundary satisfies 0.5=sigmoid(w1x1+w2x2…+b). And yes, the examples live in a very large-dimensional space. I am not sure whether the boundary is linear (hyperplane), or a more general manifold. My guess is the latter. Not following you on the log(1)=… part. Taking logs of both side of the sigmoid.
I would need further clarification.

1 Like

@kenb Thanks for being so supportive and explaining so many things. It is genuinely helping me so much, and I’m feeling more passionate about this subject already.

Coming to my follow up, 0.5=sigmoid(w1x1+w2x2…+b), I used the property which we derived earlier:
y=sigmoid(z) implies z=log((1-y)/y), giving me w1x1+w2x2…+b=log((1-0.5)/0.5)
Hence log(1)=w1x1+w2x2…+b
The reason I said that the boundary is linear(hyperplane) is because log(1) is a constant.
Therefore the equation is w1x1+w2x2…+b=constant, which to my knowledge is analogous to a plane equation for higher dimensions?
Do let me know if I have made some super silly mistake.

@kenb I did see the optional video on the derivation of binary cross-entropy loss from bernoulli distribution by maximum likelihood estimation, but is my reasoning about the derivative of sigmoid(w^t x+b) being complex and resulting in multiple 0 points also correct?

Great, you have been through the video and know how the binary-cross entropy arises via maximum likelihood. And MLE has many desirable properties. That’s what you need to understand, and you do!

Just to be sure that we are on the same page, I will go back to the simple regression formulation of the logistic model we discussed earlier:


But remember, this expression is on valid for y’s strictly inside the unit interval. For example, y =0.5 which defines the function for the boundary between “successes” (y=1) and “failures” (y=0). So, repeating your calculation we get


Which means your decision boundary is the (m-1) dimensional hyperplane defined byhyper2 .

If the data is strictly binary, y=0 or y=1, we cannot even form the left-hand side data vector (like we can if the y’s are actually probabilities, i.e. frequencies) because the log is either infinite or ill-defined. With respect to the latter, you correctly point out that they have complex roots–Yuck. We do not even have to consider the gradient of such a monster.

How are we doing?

@kenb Won’t the decision boundary be at y=0.5 correctly even if the data(I am assuming you meant input data as in supervised learning labels which would be an absolute 0 or 1) is strictly binary because the predicting model is making guess 1 when it is>=0.5 hence 0.5 is the deciding point. and plugging it in gives us the m-1 dimension hyperplane?

So in the cat, not cat model, we did have this m-1 dimension hyperplane correct?
Is it right to say logistic regression always has a hyperplane decision boundary as a result?

Correction! n_x x 1 dimensional! Number of features, not number of examples. My bad!

And yes, you are right. Pr(y=0.5 | x) defines the decision boundary in either interpretation of the model. One must always be careful when saying always :smile:, but for all practical purposes (ones that we well encounter), yes it is a hyperplane.

It’s a hyperplane because z is a linear (affine, more exactly) function of the features.

@kenb Ah yes. I didn’t give the m-1 a second thought either. But yay, I feel like I truly understand this concept now. This discussion was everything and more than what I hoped for when I asked the question. I think more than the course I am going to love this supportive community. Do you still have access to the course lectures and discourse after completion to reach out for more clarifications? Either way, genuinely thank you so much. I leave this discussion a lot more confident and comfortable.

:+1: Yes, we all have access to the course materials. I am looking forward to having you progress through the specialization! Cheers, Ken


One last thing to close out this topic. Since you have digested the derivation of the cost function from the Bernoulli distribution via maximum likelihood, you can do the same for the MSE cost function of the basic linear regression model. Keep in mind that the linear regression model can be applied to situations for which the target variable y is a continuous real-valued random variable, i.e. it has values in (-inf, +inf), unlike that for binary classification via logistic regression.

The key assumption is that y is normally-distributed. The normal distribution is completely characterized by the mean (mu) and variance (sigma^2):


where y .

In the derivation, recognize that the errors,


are normally distributed with mean zero and variance sigma^2. If you have a course or two in prob/stat under your belt, then you will recognize the following notation:


If not, no worries, it’s just shorthand.

Finally, we assume that the y’s are not only identically normally distributed, but independent as well. With that information, you can re-watch the video and follow along in the derivation of the log-likelihood function. You will discover that the cost function is the mean-square error (MSE) function:


Takeaway: The likelihood principle is what binds the two different cost functions.


@kenb This was really cool! Thanks! This leads me to ask, are most loss functions calculated by Maximm liklehood or just when we assume input data is from some distribution which makes sense, eg: Bernoulli gives binary cross-entropy and normal/gaussian distribution gives us Mean squared error.

In general, no. A simple example is the mean absolute error (MAE) cost function, i.e. the mean of the absolute values of error. This can be applied in situations in which one has an argument for not penalizing “outliers” as heavily relative to MSE. Personally, I have not yet to felt compelled so I do not have a good use case for you.

In this specialization, you will rely almost exclusively on binary cross-entropy (cat, not-cat) and categorical cross entropy (cat, dog, …, none of the above). The former is a special case of the former (as you may have guessed!). More to come in the ensuing courses, so stay-tuned!

If you wish to take as quick glance of the loss functions offered “off-the-shelf” by Tensorflow (also to come), you can refer to the documentation:

https://www.tensorflow.org/api_docs/python/tf/keras/losses .

At a glance, Kullback-Leibler divergence, which measures the (relative) distance between two probability distributions is a good candidate for maximum-likelihood concepts two show up provided that both distributions are Gaussian. But that is pure speculation on my part! And, we will not get near that in the DL Specialization.