Categorical variables in anomaly detection

rmwkwok · September 21, 2022, 12:52am

Gaussian is a handy distribution as I explained here, but as you said -

I think we should always look at this from the opposite direction - it’s not that we have to make everything to be gaussian, but it is our responsibility to make sure we are applying a correct statistical model assumption to the data in question. For example, we don’t ask how we can make any data to be gaussian, instead we ask whether our data is gaussian.

For your example of is_holiday, we ask whether it is gaussian distributed. The answer is obviously not because it is not a continuous variable whereas gaussian is a continuous distribution. In this case, we think about what statistical model assumption can better or best describe the data in hand. I would consider a binomial distribution.

Note that this course won’t cover all distributions so you will have to look up the right distribution yourself such as by googling or going to a library.

The anomaly detection algorithm in the course uses two assumptions: data to be gaussian distributed, and features are independent of each other. So it’s our job to verify it is happening that way. The first assumption tells us the form of the statistical model - but it doesn’t have to be gaussian. The second assumption tells us that we can multiply together all features’ probability density functions - otherwise we can’t. The first assumption is relatively easier to deal with - we only need to find the best one if it is not gaussian; the second one is relatively more difficult to deal with - because we will need to think about how to handle them, unless we tolerate the inaccuracy brought by the wrong assumption. Sometimes we need to accept some inaccuracy for assumptions can’t always be perfectly met - it’s a balance, and it’s a decision you need to make as the person who model the data.

I am going to also answer your other post here. First, I can’t tell you what data to be screened out just from the distribution and I can’t justify any data point to be anomalous. I don’t know the scope of your project; I don’t know the meaning of that feature; I don’t know the meaning of that feature value; I don’t know about the data collection process. Second, we ask ourselves how we should model the data - should we use Gaussian or not?

Cheers,
Raymond

Topic		Replies	Views
C3_W1_Anomaly_Detection Questions Unsupervised Learning, Recommenders, Reinforcement week-1	2	152	June 1, 2024
C3_W1 Why use the Gaussian distribution Unsupervised Learning, Recommenders, Reinforcement week-1	3	579	September 9, 2022
Multivariate normal distribution vs Gaussian Mixture Models Unsupervised Learning, Recommenders, Reinforcement week-1	1	612	August 30, 2022
C3_W1_Anomaly Detection_Feature_Distribution Unsupervised Learning, Recommenders, Reinforcement week-1	1	503	March 5, 2023
Difference between Anomaly detection and classification Unsupervised Learning, Recommenders, Reinforcement week-1	3	594	July 28, 2022

Categorical variables in anomaly detection

Related topics