W2 | Logistic Vs Linear Regression | Would you expect a bad model and why?

But a value of 10 would be enough to move the gradient, that is the point. Or a value that is a ratio of some other value.

I find half of the comments about your pride of having said X about Y comments before quite annoying. It’s good that you know the answers maybe, but certainly wasn’t at the first try.

Apologies for that, I was just trying to make this more of an interactive discussion. I saw one thread like this, and wanted to try this out myself, but perhaps, I should stick to the more formal Q/A format :joy:

Nonetheless, I have invited another mentor to this discussion. He will soon try and answer your query. I hope that helps.

Cheers,
Elemento

As a rule of thumb you only need it if the reader is skipping what you say. Not my case, your comments are appreciated.

And so are yours :blush:

Cheers,
Elemento

Hey @Mah_Neh,
Can you please check this thread out once? I hope this helps.

P.S. - Thanks a lot @anon57530071 for the suggestion :blush:

Cheers,
Elemento

Hi, @Mah_Neh and Elemento. I followed your thread and decided to weigh in from a probability and statistics perspective. @Mah_Neh, if you have a statistics background, the following may shed some light on the issue at hand. If not, it won’t overtax the server’s storage capacity!

Consider, the following linear regression model:

\log\left(\frac{y}{1-y}\right) = w^T X + b + \epsilon \hspace{5em}(*)

There are a few nuances here.

  1. y \in (0, 1) is a continuous random variable. That means that it can be interpreted as a probability.
  2. Given (1), the dependent variable is the log of the odds-ratio. E.g. if y=0.25, then the odds (of an event occurring–like your horse placing first) are are 3 (i.e. 3-1, or “3-to-1”).
  3. The odd-ratio, \tfrac{y}{1-y} is also continuous and contained in (0,\infty), i.e. it is a positive real number.
  4. The log of the odds-ratio, \log\left(\frac{y}{1-y}\right) – the dependent variable – is any real number, i.e. contained in (-\infty, \infty).
  5. The linear regression model (*) typically assumes that, \epsilon is a normally-distributed random variable with mean 0 and variance \sigma^2. The dependent variable thus inherits that distribution.
  6. An aside: Point (5) implies that the odds-ratio is log-normally distributed.

All of this implies that (*) is a straight-up linear (normal) regression model, note that the log-odds-ratios is not a categorical variable (either 0 or 1). It is not labelled data.

As an example of where this model might be appropriate, suppose we are marine biologists and are interested in the presence of a particular fish parasite in various lakes throughout a region. The features of X are properties of a particular lake (water temperature, alkalinity, etc). Each of the lakes in the region is an example in the dataset that we compile. We sample lots of fish in each lake of fish and come up with the proportion of affected fish for each lake. That proportion, with careful sampling procedures, may be interpreted as a probability (y).

Run the linear regression (*) on our data set and the predict the affected proportion of fish (the probability that the fish is infected) in a lake where the fish have not been tested sampled, but we know the lake’s feature values.

The expected log-odds ratio, which means that \epsilon is set to zero, is

\log\left(\frac{y}{1-y}\right) = w^T X + b \hspace{5em}(**)

Now solve (**) for the probability, y. Let z = w^T X + b , is you wish:

\log\left(\frac{y}{1-y}\right) = z \iff \log\left(\frac{y}{1-y}\right) = z \iff \frac{y}{1-y} = e^z\iff \frac{y}{1-y} = e^z
\iff \frac{1-y}{y} = e^{-z}\iff \frac{y}{1-y} = e^z \iff \frac{1-y}{y} = e^{-z} \iff \frac{1}{y} = 1 + e^{-z} .

And finally,

y = \frac{1}{1 + e^{-z}} .

Look familiar? The predicted probability follows the logistic distribution. Here linear regression on the log-odds ratio is an entirely standard application of linear regression and the “best” estimator (i.e. maximum likelihood estimator or “MLE”) of the parameters minimizes the mean square error (MSE).

Alternatively, using this same non-categorical dataset (no labels), one could propose the truncated normal linear regression model

y = w^TX + b + \epsilon

where both tails of the normal distribution for the error term \epsilon are “chopped off”–for values less zero, and those greater than one. Now we are into a whole new territory. The “best” estimate of the parameters is not the one that minimizes MSE. And, it doesn’t seem very appealing on intuitive grounds does it?

There are other distributions that could work, chiefly the beta distribution, but again, we would have to struggle to find the “best” estimator and it will not minimize MSE.

Furthermore, when we go beyond binary classification, this type of procedure is not tractable.

@Elemento pointed you to a good source describing why linear regression on categorical data is not such a hot idea (but not horrible if your model uses some explicit truncation and derive the MLE). In the world of categorical data, that leaves us with the binomial distribution as the natural probability model for binary classification, in which cross-entropy loss turns out to be the “best” loss function, for labelled data and the multinomial distribution for higher dimensional classification problems.

If this obfuscates more than enlightens, feel free to purge it from your “wet” memory and storage!
:smile:

I had enough for today but I normally log here daily, so I’ll be replying tomorrow. The more in depth, the better. Thanks for your reply.