Dear Andrew,
I am a great fan of your courses and have followed a lot of them. Anomaly detection is the only item so far, where I felt the method of gaussian MLE estimation you are suggesting is not optimal and they may be a better way.
Here is my suggestion:
Why not using Kernel Density estimates?
Method: We estimate the probability distribution as follows:
P(x) = 1/m * sum_i=1^m( prod_j=1^n( G(x; X_j^i, sigma) ) )
where
G is the one-variable gaussian function shown in the course
X^i are training examples
m is the number of training examples and n is the number of features.
Just like epsilon controls precision and recall, in my above suggestion the parameter sigma will also control precision and recall; high sigma → low precision, high recall and low sigma → high precision, low recall.
Although KDEs is not a proper statistical inference tool, however here we are not after exact distribution of the training data set.
The pros of my suggestion:
- Can address correlated features
- No need to scale non-Gaussian features to make them Gaussian. This process is already very cumbersome if there are more than 20 features for example.
- The course method cannot address mixed distribution (double bell for example), while my suggestion can.
I will be very grateful to receive your feedback Andrew and it will help me gain deeper understanding of Anomaly detection.
Kind regards,
Shankha.