There is one thing I don’t understand at all in the anomaly detection section. We’re referring to the normal distribution, okay. But it seems that there is a confusion between the density function and the probability of such a law. For me :
is NOT a probability but a probability density function.
However, several times p(x) is called “probability”. And this remains true when we have several features and we multiply these so-called “probabilities” (assuming that there is independence):
Hey @Pierre_BEJIAN, I guess the confusion lies between between treating p(x) as a function and p(x) as the probability value corresponding to x. And this confusion is genuine, since both of them use the same notation.
When Prof Andrew refers to this as the probability, he is referring to the later one, i.e., when you feed x to this function, and it returns p(x) as the corresponding probability value. For example, when x = 0, in that case
denotes the value of probability, and not the probability density function. Similarly, when we have multiplied the probabilities, they are the probability values corresponding to the different elements of a single x, since x now has multiple dimensions. Let me know if this helps.
P.S. - Please ignore this answer since I was a bit confused between probability density and probability for continuous spaces. Apologies for the confusion!
I think Pierre’s point is well grounded, and understood Andrew’s explanation as “probability (strictly speaking probability density function, but if you don’t know the difference don’t worry about it)”.
A simple argument to invalidate p(x) as a probability is considering a small variance standard deviation \sigma < \frac{1}{\sqrt{2\pi}}
That leads to a maximum p(x)>1, therefore impossible to be a probability.
From another point of view, if x is continuous, the probability of any particular value is zero. To get probabilities larger than zero, we have to consider integrating the pdf along an interval.
Exactly as @Jaume_Oliver_Lafont said, the formula you show is the probability density function. You may disagree with what I am going to say, but it’s not disasterous to use probability and probability density interchangably here. My reason is, in between integrating p over a grid around x and just p(x), I think both are good measures of the chance to see a sample at point x, because you may agree that the higher the probability density, the higher it is for the probability given a fixed grid size to integrate.
However, if the video mentions the formula as probability function, then I agree that it’s only right to call it the probability density function. However, I think it’s not disasterous, and p(x) itself is also a good proxy for probability.
I think an important thing to remember is that the y-value of a particular point on the Gaussian curve relative to an x-value is the ‘Probability Density Value’. You have to think in terms of probability as the area under the curve. The probability of a specific (isolated) value in the case of continuous Data is 0. The easy way to get the probability that a value is lower or equal a specific value is to generate the Cumulative Distribution Function (CDF). From the CDF, the y-axis will directly give you the probability and allow to avoid to compute the integral of the specified interval.
You are right. I think Andrew tried to avoid complex probability theory for this course and therefore left some gaps.
The programming exercise uses the right calculation “under the hood”: check utils library imported at the beginning of the code! More specifically, the multivariate gaussian function.
The correct probability comes from a numerical integration of the density probability function or the gaussian p(x) from -infinite to x_i. AN INTERVAL.
And yes, on a single point the probability is zero. But it doesn’t mean an impossibility! But this for another topic
What makes the terminologies in the videos ambiguous is that, when we’re talking about probability, we must state very clear the event.
In this case, we may talk about the probability of the event “x is anomalous”, not probability of “x”, because x is an example in the data set. But that is another problem.
The idea of anomaly detection as Andrew said, is based on Density Estimation, so there is no probability here. I prefer to use other letter like f to indicate probability density function (or density for short) instead of p, and f(x) is then read as density at x.
If I understand correctly, you are referring to the area of the small part obtained by integrating over an interval containing x. Yes, I think your idea is very interesting, because this value can be used to detect if x is an anomaly as well.
The downside is that we have to compute integration for every x. Instead, we can fix an interval around mean, where most examples concentrate, and compute the probability that x belongs to this interval. But this probability is the same for all x.
Now imagine we move 2 vertical segments around x towards each other until they meet. The length of this segment is exactly the density at x.
For an experienced person, the interpretation will be good but I think it is better to use the right notation and to make the difference (especially for 2 notions that are confused and yet totally different). For people who will continue with the Bayesian approach, it is important to qualify probability/Likelihood
As I go through the discussion here again, I think everybody agrees that
probability density is different from probability.
the formula p(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{\sigma^2}} is the formula for probability density
we get the probability by integrating p(x) over a region of x, be the range small or large
Some of us may agree that this specialization is not for teaching probability, but it cannot stop us from applying it.
I think that probability density p(x) is a good PROXY for Probability: P(x) := P( x - \frac{\delta}{2} < x < x + \frac{\delta}{2}) \approx p(x)\delta \propto p(x) given a very small fixed range \delta.
so that the higher the probability density, the higher the probability within a region of that fixed size.
However, some of us may want to pose the check differently as P(\text{event of observing as extreme as }x') = P( -\infty < x < x') when x' is negative or, P(\text{event of observing as extreme as }x') = P( x' < x < \infty) when x' is positive.
I believe we will agree that both can work in practice the same way given the right \epsilon value adjusted according to whichever approach adopted. Definitely, practically speaking, approach 1 would be preferred because it is computationally cheaper without doing any integration. This is why I support the implementation of using directly the probability density function as did by the lecture and in the assignment to do anomalous detection.
The approach 2 above is pretty frequentist approach, whereas if we want to adopt the Bayesian approach and in order to construct the likelihood, I guess we would want to know some anomalous samples which we assumed we didn’t have in the training set for this course because we are discussing it under the context of unsupervised learning.
Thanks @Jun_Wu, my message is not about the dimensions considered, so it is up to the audience to read it as is, or expand it to mutli-dimensional. And welcome to our community Jun