C3_W1 Why use the Gaussian distribution

Hi,

In the lectures, Professor Ng uses the Gaussian distribution to perform anomaly detection. I was wondering why the Gaussian distribution is used instead of other distributions.

Hi @Andromeda18 !
We have to remember that this is 1 approach to the problem at hand, and there are others. In this approach, we use the statistical behavior of a normal distribution to detect anomalies. A normal distribution has several interesting features to address this problem that other distributions do not. I recommend you to study a little more about the normal distribution, for sure everything will be clearer for you.
To try to simplify for you, basically if an example is found after the third standard deviation of the mean in a normal distribution, its probability in view of that distribution will be very small, do you agree?
Hope this helps!

Hello @Andromeda18, when we build our next anomaly detection system, it’s our job to verify that the sample’s distribution matches with our model assumption. For example, in the video, we assumed the samples to be gaussian distributed on each feature, and we assumed independence among features.

For whether or not it is gaussian distributed on a feature, quantitatively speaking, we can measure it by using method like the Kolmogorov–Smirnov test, and qualitatively, for example, we can examine whether the sample generation process on that feature dimension is an additive process. An example is the distribution of environmental vocal noise level is likely to be gaussian because the noise level is an addition of various noise sources which can be a car passing-by, pedestrian talking on phone or to each other, construction work, and so on. While any of these can be non-gaussian, the addition of them will become gaussian according to the central limit theorem.

Many processes are additive, so the gaussian distribution is a pretty popular choice for modeling a random variable.

Raymond

Hi,

@Lukas_Mendes, I definitely agree with you that examples that are 3-sigma away from the mean are far less likely to exist. I suppose my question was related to the assumption that the data is normally distributed. I realize that the normal distribution applies to many natural phenomena and I know it’s a very popular choice for modelling random variables, but I think that, personally, I feel more comfortable undertaking some type of formal evaluation of the data’s distribution, like the one @rmwkwok mentioned.

1 Like