Splitting train and cv/test set for anomaly detection

kaian0414 · June 29, 2023, 1:38pm

Hi professors, tutors, and classmates, I just finished the part of “anomaly detection”. I have a bit confused of how to split the train and cv/test set. As my current understanding, the sets should be splitted as my follow mentioned. Please help me to see if my understanding is correct or provide me the correct form. Thanks

suppose we have some unbalanced data, 0 = normal, 1 = anomaly, and we have known some anomaly data.

for training set, we train the data with features but no labels. the original label of train set can be 0 and 1, that’s never mind. (am I right???)
for cv/test set, we train the data with both features and labels, including the 0 and 1 labels.

rmwkwok · June 30, 2023, 9:23am

Preferrably with only the normal data. However, if the anomalous ones take an insignificant portion, that should not be too harmful.

We don’t train anything with the cv/test dataset, but yes, we need labels (and features) to evaluate our trained model.

Cheers,
Raymond

kaian0414 · July 3, 2023, 12:39pm

Got it. Thanks so much!!! You made my mind become clearer!

Topic		Replies	Views
Understanding Nature of Problem in case where test data is not labeled AI Discussions	1	51	August 7, 2022
Anomaly Detection vs Supervised Learning Unsupervised Learning, Recommenders, Reinforcement week-1	2	399	May 15, 2024
Topic suggestion for MLOps course AI Discussions	1	47	November 7, 2022
Train_dev_test split doubt Structuring Machine Learning Projects	2	539	September 21, 2022
Data Splittting Strategy in Supervised ML Supervised ML: Regression and Classification week-3	15	265	March 8, 2024