In my job I have faced an issue related to Telecommunications Radio KPI. I was asked to do a regression calculation from KPI values but, in the dataset I was given, there were some records with an unexpected pattern as they looked like be comming from faulty radio equipment.
In fact, I needed to work with packet loss and channel load and it is expected that higher channel load leaded to higher packet loss but I saw sometimes lower or medium channel load leaded to higher packet loss and this is due to hardware errors.
The problem is that both packet loss and channel load are percentages between 0 and 100% so there is no outrageous values such as -10% or 120%.
It is also needed to be taken into consideration that values around 10% or 90% are not outliers. All value between 0% and 100% are expected values.
The only anomalous values are those comming from the combination of low-medium channel load and high packet loss.
What task were you asked to do?
Are you trying to create a model of the typical performance?
Or are you trying to identify the faulty equipment so it can be repaired?
This course doesn’t discuss “data cleaning” at all. It’s a difficult topic all its own.
One approach you might take is to use the anomaly detection method that is discussed in MLS Course 3, and use that to identify the examples you want to remove.
But you have to be careful that you don’t remove examples which are important.
If you create a model that only uses a cleaned data set, that model might not work very well to make predictions on new data which might include anomalies.
I would probably start with a visualization followed by a statistical analysis. Question would be if it is possible to describe e.g. with two features (or more?) this characteristic anomaly pattern that you described as a combination of:
low-medium channel load
and high packet loss.
If so, you could fit e.g. a Gaussian mixture model in the next step and evaluate the model capabilities within your feature space with relevant metrics and a residual analysis.
I would expect that after a fist visualization you could judge if your features are sufficient to solve your business problem in an acceptable way. If not following the CRISP-DM methodology in an iterative way and e.g. enhancing your features might be a good option.
This is also a possibility! I understand that in this case you only want to use „normal data“ to train your anomaly detection model in an unsupervised way. This approach was described here:
For example a popular approach is that you can learn your normal behaviour as „normal cluster“ and if a certain data point is too far away from this cluster conclude it is an anomaly.
Autoencoders for example are a popular choice for anomaly detection or you have a sufficient amount of normal labels and the problem is suiting.