Splitting train and cv/test set for anomaly detection

Hi professors, tutors, and classmates, I just finished the part of “anomaly detection”. I have a bit confused of how to split the train and cv/test set. As my current understanding, the sets should be splitted as my follow mentioned. Please help me to see if my understanding is correct or provide me the correct form. Thanks

suppose we have some unbalanced data, 0 = normal, 1 = anomaly, and we have known some anomaly data.

  • for training set, we train the data with features but no labels. the original label of train set can be 0 and 1, that’s never mind. (am I right???)
  • for cv/test set, we train the data with both features and labels, including the 0 and 1 labels.

Preferrably with only the normal data. However, if the anomalous ones take an insignificant portion, that should not be too harmful.

We don’t train anything with the cv/test dataset, but yes, we need labels (and features) to evaluate our trained model.


Got it. Thanks so much!!! You made my mind become clearer!