Hello @Maxim_Kupfer,
Gaussian is a handy distribution as I explained here, but as you said -
I think we should always look at this from the opposite direction - it’s not that we have to make everything to be gaussian, but it is our responsibility to make sure we are applying a correct statistical model assumption to the data in question. For example, we don’t ask how we can make any data to be gaussian, instead we ask whether our data is gaussian.
For your example of is_holiday
, we ask whether it is gaussian distributed. The answer is obviously not because it is not a continuous variable whereas gaussian is a continuous distribution. In this case, we think about what statistical model assumption can better or best describe the data in hand. I would consider a binomial distribution.
Note that this course won’t cover all distributions so you will have to look up the right distribution yourself such as by googling or going to a library.
The anomaly detection algorithm in the course uses two assumptions: data to be gaussian distributed, and features are independent of each other. So it’s our job to verify it is happening that way. The first assumption tells us the form of the statistical model - but it doesn’t have to be gaussian. The second assumption tells us that we can multiply together all features’ probability density functions - otherwise we can’t. The first assumption is relatively easier to deal with - we only need to find the best one if it is not gaussian; the second one is relatively more difficult to deal with - because we will need to think about how to handle them, unless we tolerate the inaccuracy brought by the wrong assumption. Sometimes we need to accept some inaccuracy for assumptions can’t always be perfectly met - it’s a balance, and it’s a decision you need to make as the person who model the data.
I am going to also answer your other post here. First, I can’t tell you what data to be screened out just from the distribution and I can’t justify any data point to be anomalous. I don’t know the scope of your project; I don’t know the meaning of that feature; I don’t know the meaning of that feature value; I don’t know about the data collection process. Second, we ask ourselves how we should model the data - should we use Gaussian or not?
Cheers,
Raymond