Categorical variables in anomaly detection

Does the covariance method work when you have categorical variables. For instance, if I’m comparing values to whether the day of the week is a holiday or not, do I need to transform the binary value for the is_holiday feature so that it has a gaussian distribution? Or do I need to look at different models at this point?

Hello @Maxim_Kupfer,

Gaussian is a handy distribution as I explained here, but as you said -

I think we should always look at this from the opposite direction - it’s not that we have to make everything to be gaussian, but it is our responsibility to make sure we are applying a correct statistical model assumption to the data in question. For example, we don’t ask how we can make any data to be gaussian, instead we ask whether our data is gaussian.

For your example of is_holiday, we ask whether it is gaussian distributed. The answer is obviously not because it is not a continuous variable whereas gaussian is a continuous distribution. In this case, we think about what statistical model assumption can better or best describe the data in hand. I would consider a binomial distribution.

Note that this course won’t cover all distributions so you will have to look up the right distribution yourself such as by googling or going to a library.

The anomaly detection algorithm in the course uses two assumptions: data to be gaussian distributed, and features are independent of each other. So it’s our job to verify it is happening that way. The first assumption tells us the form of the statistical model - but it doesn’t have to be gaussian. The second assumption tells us that we can multiply together all features’ probability density functions - otherwise we can’t. The first assumption is relatively easier to deal with - we only need to find the best one if it is not gaussian; the second one is relatively more difficult to deal with - because we will need to think about how to handle them, unless we tolerate the inaccuracy brought by the wrong assumption. Sometimes we need to accept some inaccuracy for assumptions can’t always be perfectly met - it’s a balance, and it’s a decision you need to make as the person who model the data.

I am going to also answer your other post here. First, I can’t tell you what data to be screened out just from the distribution and I can’t justify any data point to be anomalous. I don’t know the scope of your project; I don’t know the meaning of that feature; I don’t know the meaning of that feature value; I don’t know about the data collection process. Second, we ask ourselves how we should model the data - should we use Gaussian or not?

Cheers,
Raymond

Thanks for the thoughtful reply, and I certainly have lots to learn on the data science side of things, although I do wish the course did more work or at least provide next steps on understanding anomaly detection as there is just SOOO much more to it.

I did do some searching, but most paths led to long rabbit holes of academic research. But one thing I still can’t get a straightforward answer to is how to combine continuous and categorical variables for anomaly detection. Do you have any guidance on where I can start looking or how to properly phrase my question?

Indeed… the course isn’t and actually can’t be covering all the details. If they cannot offer it now, just go for other sources =P. I think you might want to look for some basic statistics courses or books.

Here is what I can share with you right now very quickly, they are entry level statistics so you may find complete description and more examples of any part of the below from any statistics books:

  1. This formula is general, which says the probability of observing a and b is by the (probability of observing a) times (the probability of observing b given that a is observed).
    P(A=a, B=b) = P(B=b | A=a) \times P(A=a)

  2. Let’s say A and B are the features. They can take a single value x (e.g. A=x) or a range of value (e.g. A >= x)

  3. Bringing in the assumption of independent features, this will let P(B| A) = P(B), in other words we have,
    P(A, B) = P(B) \times P(A)

  4. Bringing in the assumption that B is a continuous variable that follows Gaussian distribution, and A is a binary variable that follows binomial distribution, we ask the probability of B being not less than a threshold (e.g. temperature >= 40) and A taking a value of True (e.g. is_holiday = yes), then we have
    P(A=\text{yes}, B \ge 40)
    = P(A=\text{yes}) \times P(B \ge 40)
    = p \times \displaystyle \int_{40}^{\infty} \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right) \,dx

  5. Here p, \mu, and \sigma are parameters learnt from data. p means the chance of being yes for is_holiday. Note that \text{yes} and 40 are to the feature values of the new sample being tested for anomaly

  6. We have the threshold \tau, and we can say that if P(A=\text{yes}, B \ge 40) < \tau then the new sample being tested is anomalous.

Please note the following few points:

  1. We see how the formula change from the general form to a more useful form under assumptions. If we don’t make those assumptions, we don’t come to the equation in step 4.

  2. Binomial is for binary categorical variable, and a possible choice for multi-class categorical variable is the multi-nomial distribution, but again, it’s always your job to verify the choice of distribution. I am not saying that we should use this or use that, or we could always do this or do that. This still has to be a job for human, not robot :wink:

  3. The integration sign in step 4 may look unfamiliar to you because it never shows up in the lecture. We need to know that \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right) alone (without the integral sign) is a probability density function whereas p is a probability. Just as we can’t compare an apple to an orange, we can’t multiply a probability density function with a probability and then say the outcome to be a probability value. The integration is there to make it a probability, so that the multiplication result becomes a probability value that can be compared with a probabilistic threshold \tau.

  4. Do we need to learn integration calculus to implement it in Python? No, scipy.stats.norm.sf (reference here) will do the job, but you need to experiment it to find out how to use it.

  5. Again, step 4 is the correct way to calculate probabilities, using the integration sign. However, don’t be surprised if you see someone take the integration sign away and just plug in x=40. Unsurprisingly, this won’t calculate a probability, and consequently our \tau won’t have the meaning of probability either. However, if you don’t care about the meaning of being a probability value, we can still always find some value of \tau (without the meaning of a probability) to compare with that outcome and conclude whether the sample in question is anomalous or not.

  6. Some good things about having the integration sign and for the outcome to have the meaning of probability are that (1) we can intepret it as a probability that human or your CEO can understand (2) we can further manipulate it like a probability (using our probability mathematics knowledge) if our project requires us to, so the result is extensible. A good thing about not having the integration sign is it calculates faster.

Cheers,
Raymond

PS: Grab a statistics book :wink:

1 Like

Wow, super helpful stuff @rmwkwok! Thank you!

1 Like