Mu, var values from X_train or X_val?

Captain_Riggs · November 28, 2023, 6:13pm

When you go though ‘C3_W1_Anomaly_Detection’ you notice that the mean and variance is calculated solely from X_train. You then apply those values to X_val to calculate the best epsilon value. Should the mean and variance not be calculated solely from X_val?

You are blinded as to how many possible anomalies are in X_train. What if there are many anomalies? Wouldn’t many anomalies bias the mean and variance values towards corruption? Thus the covariance and reverse covariance dimensionality shape would be impure (more bulbous). Applying those value to X_val would give an incorrect epsilon value. When you then apply that incorrect epsilon value, from X_val, to X_train you are creating possible false negatives, only you would not know this.

As you know the true positive state of X_val would you not use that set to calculate the mean and variance and therefore epsilon? You can then apply epsilon to the X_train set and be more confident in the outcomes?

There is obviously a reason why you do not take the mean and variance solely from X_val. I just do not know why.

TMosh · November 28, 2023, 7:29pm

No. The validation set isn’t used for training.

The training set is assumed to be large enough that it contains a suitable number of anomalies.

Captain_Riggs · November 28, 2023, 7:54pm

I thanks for the reply.

Can you see my point? What if the X_train set contains a lot of anomalies?

Then mean and variance will be skewed towards the anomalies.

rmwkwok · November 28, 2023, 9:25pm

Hello @Captain_Riggs,

Then this may either be a problem or not a problem.

This may not be a problem if your new data can still be identified correctly with your current model.

Or, this is a data shifting problem - the data we collected for training no longer represents the same distribution as the data we need to make prediction for.

In this case, the simplest step would be to train a new model with a representative set of data.

You may consider your current X_val as part of your new X_train. In other words,

X_{train}^{new} = X_{val}^{current} + more

Then you need a new set of validation data X_{val}^{new} to validate your model - afterall we can’t reuse X_{train}^{new} to validate a model built with X_{train}^{new} itself.

Cheers,
Raymond

Captain_Riggs · November 28, 2023, 9:51pm

Hmm. I see what you are saying.

Thanks

Captain_Riggs · November 28, 2023, 9:53pm

Should have posted here.

Hi Raymond.

I see what you are saying and that makes sense.

Cheers

Topic		Replies	Views
Covariance calculation is wrong? Unsupervised Learning, Recommenders, Reinforcement week-1	20	376	December 7, 2023
Normalization of test data Supervised ML: Regression and Classification week-2	4	524	December 23, 2022
C3_W1_Anomaly_Detection Lab 2 error Unsupervised Learning, Recommenders, Reinforcement week-1	5	565	November 17, 2022
C3_W1_Anomaly_Detection AssertionError: Wrong shape for mu. Expected: (3,) got: (2,) Unsupervised Learning, Recommenders, Reinforcement week-1	2	530	November 17, 2022
Why do we need to have a validation set for training? Advanced Learning Algorithms week-3	17	915	February 8, 2023

Mu, var values from X_train or X_val?

Related topics