When you go though ‘C3_W1_Anomaly_Detection’ you notice that the mean and variance is calculated solely from X_train. You then apply those values to X_val to calculate the best epsilon value. Should the mean and variance not be calculated solely from X_val?
You are blinded as to how many possible anomalies are in X_train. What if there are many anomalies? Wouldn’t many anomalies bias the mean and variance values towards corruption? Thus the covariance and reverse covariance dimensionality shape would be impure (more bulbous). Applying those value to X_val would give an incorrect epsilon value. When you then apply that incorrect epsilon value, from X_val, to X_train you are creating possible false negatives, only you would not know this.
As you know the true positive state of X_val would you not use that set to calculate the mean and variance and therefore epsilon? You can then apply epsilon to the X_train set and be more confident in the outcomes?
There is obviously a reason why you do not take the mean and variance solely from X_val. I just do not know why.
Then this may either be a problem or not a problem.
This may not be a problem if your new data can still be identified correctly with your current model.
Or, this is a data shifting problem - the data we collected for training no longer represents the same distribution as the data we need to make prediction for.
In this case, the simplest step would be to train a new model with a representative set of data.
You may consider your current X_val as part of your new X_train. In other words,
X_{train}^{new} = X_{val}^{current} + more
Then you need a new set of validation data X_{val}^{new} to validate your model - afterall we can’t reuse X_{train}^{new} to validate a model built with X_{train}^{new} itself.