If we use this model, then let me ask you this, how do you know that the threshold of 0.5 is the correct threshold for deciding whether the output is “yes” or “no”. If your reply is to constraint the values between 0 and 1, and then use this threshold, since it is the logical mid-point, then isn’t that what just sigmoid does? And in that case, won’t your linear regression + sigmoid model essentially be the same as a logistic regression model.
If you have any other reply to my question, then do let me know, and we will discuss it from there.
As to this, deciding the threshold would even become more difficult. In order to classify a given sample as one of the 10 categories, you will have to find out 9 thresholds.
Just for clarity, I am talking about doing linear regression only, not taking the sigmoid of the result.
And in that case, won’t your linear regression + sigmoid model essentially be the same as a logistic regression model.
Well not in my opinion, because they back-propagate differently. But if they were the same I wonder why we didn’t use it since it is simpler and faster than logistic regression.
I believe the problem with a linear-only function is the break point. Bc we wouldn’t know where the break point value is. Is it 0, 1, 2 …?
I don’t know if training a linear model as a binary classificator, with 0-1 labels, and using 0.5 as break point could work though, it may be as effective. The higher value would be infinite (not really, but a high number), and the lowest number -inf.
This is due to the differences in calculating the outputs only, don’t you think? If we use a sigmoid function on top of linear regression outputs, and we use the same loss function in this case and a logistic regression model, won’t the back-propagation would essentially be same in both the cases?
But if you don’t constraint the values to 0 and 1, on what basis, are you suggesting to use 0.5? My suggestion was, let’s constraint all the values to 0-1 and then use the mid-point. If you do not constraint the values, then won’t the logical mid-point be 0 instead of 0.5?
But we are using linear regression for classification, aren’t we? Then how can we use, say mean-squared error for calculating the cost when the predicted label is a numerical value and the true label denotes a class say, 0 or 1? How do your propose we transform the labels into numerical values?
I am not really sure as to what do you mean by this. Say, our linear model predicts a value of 12 for a single image, and the predicted label is 0, then are you proposing to use a mean-squared error of 144, assuming a single example?
Yes that would be the error. You could also minimize with the square of N I guess (if the values are too large.) I mean, I am just a beginner and am thinking how this wouldn’t work.
That’s good, it’s one of the goals of the course indeed. To push the learners to enhance their understanding. Here, check this blog out. Although, this blog doesn’t compare logistic regression to linear regression, like we were doing in our discussion, but you will clearly get to know why we shouldn’t use linear regression for classification problems.
Once you understand that, you will get to know how the presence of “sigmoid” function transforms the linear regression model into a logistic regression model, and if use the same loss function for both the cases, then “linear regression + sigmoid” model is exactly same as a “logistic regression” model.
Now, as you said, we can use mean squared error as the loss function in the first case, without transforming the true labels, but there are some fallacies in that as well, which you can check out here.
Indeed they will always be there, but for continuous cases, we can’t use logistic regression. However, for categorical cases, we can use logistic regression which will greatly eliminate the effect of outliers. So, if you have a better model, why would you like to stick to a worse model?
Yes, that seems correct. So even though you can sometimes read that you can’t use linear regression for BC… it seems the right answer is that there is a better model where the optimization won’t be suffer from huge gradients etc. as you imply.
Indeed, there’s no one stopping you from using any regression model for any classification task, but when you know that you have better models available, i.e., the classification models, which are specifically designed to handle classification tasks, why would you possibly like to use a regression model?
Yes, the only reason why I thought it useful is because, without the gradient explosion it could be quicker to optimize (pretty much the case told with the ReLU, I think).
I believe you could still cutoff the max value of “y” to, say, 10, but I know I am pushing to much maybe.
I find it quite stimulating to think about though.
A fun fact to note here is that you don’t know the range of values that your linear model will produce. Say for every image, it produces a value larger than 10, and we limited our y values at 10, what will we do then? The fact that a linear model will produce an unbounded value makes it very difficult to use a linear model with classification.
As I stated in the very beginning, the only way out is to constrain the linear values, which is exactly what logistic regression does with the sigmoid function. I hope this helps.