Hi, @Mah_Neh and Elemento. I followed your thread and decided to weigh in from a probability and statistics perspective. @Mah_Neh, if you have a statistics background, the following may shed some light on the issue at hand. If not, it won’t overtax the server’s storage capacity!
Consider, the following linear regression model:
\log\left(\frac{y}{1-y}\right) = w^T X + b + \epsilon \hspace{5em}(*)
There are a few nuances here.
-
y \in (0, 1) is a continuous random variable. That means that it can be interpreted as a probability.
- Given (1), the dependent variable is the log of the odds-ratio. E.g. if y=0.25, then the odds (of an event occurring–like your horse placing first) are are 3 (i.e. 3-1, or “3-to-1”).
- The odd-ratio, \tfrac{y}{1-y} is also continuous and contained in (0,\infty), i.e. it is a positive real number.
- The log of the odds-ratio, \log\left(\frac{y}{1-y}\right) – the dependent variable – is any real number, i.e. contained in (-\infty, \infty).
- The linear regression model (*) typically assumes that, \epsilon is a normally-distributed random variable with mean 0 and variance \sigma^2. The dependent variable thus inherits that distribution.
- An aside: Point (5) implies that the odds-ratio is log-normally distributed.
All of this implies that (*) is a straight-up linear (normal) regression model, note that the log-odds-ratios is not a categorical variable (either 0 or 1). It is not labelled data.
As an example of where this model might be appropriate, suppose we are marine biologists and are interested in the presence of a particular fish parasite in various lakes throughout a region. The features of X are properties of a particular lake (water temperature, alkalinity, etc). Each of the lakes in the region is an example in the dataset that we compile. We sample lots of fish in each lake of fish and come up with the proportion of affected fish for each lake. That proportion, with careful sampling procedures, may be interpreted as a probability (y).
Run the linear regression (*) on our data set and the predict the affected proportion of fish (the probability that the fish is infected) in a lake where the fish have not been tested sampled, but we know the lake’s feature values.
The expected log-odds ratio, which means that \epsilon is set to zero, is
\log\left(\frac{y}{1-y}\right) = w^T X + b \hspace{5em}(**)
Now solve (**) for the probability, y. Let z = w^T X + b , is you wish:
\log\left(\frac{y}{1-y}\right) = z \iff \log\left(\frac{y}{1-y}\right) = z \iff \frac{y}{1-y} = e^z\iff \frac{y}{1-y} = e^z
\iff \frac{1-y}{y} = e^{-z}\iff \frac{y}{1-y} = e^z \iff \frac{1-y}{y} = e^{-z} \iff \frac{1}{y} = 1 + e^{-z} .
And finally,
y = \frac{1}{1 + e^{-z}} .
Look familiar? The predicted probability follows the logistic distribution. Here linear regression on the log-odds ratio is an entirely standard application of linear regression and the “best” estimator (i.e. maximum likelihood estimator or “MLE”) of the parameters minimizes the mean square error (MSE).
Alternatively, using this same non-categorical dataset (no labels), one could propose the truncated normal linear regression model
y = w^TX + b + \epsilon
where both tails of the normal distribution for the error term \epsilon are “chopped off”–for values less zero, and those greater than one. Now we are into a whole new territory. The “best” estimate of the parameters is not the one that minimizes MSE. And, it doesn’t seem very appealing on intuitive grounds does it?
There are other distributions that could work, chiefly the beta distribution, but again, we would have to struggle to find the “best” estimator and it will not minimize MSE.
Furthermore, when we go beyond binary classification, this type of procedure is not tractable.
@Elemento pointed you to a good source describing why linear regression on categorical data is not such a hot idea (but not horrible if your model uses some explicit truncation and derive the MLE). In the world of categorical data, that leaves us with the binomial distribution as the natural probability model for binary classification, in which cross-entropy loss turns out to be the “best” loss function, for labelled data and the multinomial distribution for higher dimensional classification problems.
If this obfuscates more than enlightens, feel free to purge it from your “wet” memory and storage!
