As you know any model is trained based some features it has and how it has significant on a designed model based on the number of selected features/class/ dimensions.
So when a feature will be scaled differently on different dataset, your model will not learn something significant related to other dataset or will not given better result related accuracy or loss as each dataset would have its own scaled features it will try to find that scaled feature related to the dataset used in training rather the actual task of finding a base feature for all the data irrespective of training, test or validation.
Imagine you have scaled features in training dataset to 500 but for test you are using only 300 scaled features, so the test set is not really training on comparison with 500 scaled features in training set.
Actually you can try this useful. Try using different scaled features and see the result, either it might overfit or underfit based on the complexity of the model.
Sorry the wording of that explanation in the quote box above is confusing to me.
Let me clarify further my confusion of the dotpoints in the screenshot I shared in my initial post. They use the example of an input feature x = 500, corresponding to y = 300. I am confused about:
Is this input feature in the train, cross validation or test dataset?
When the model is deployed and the user feeds it an observation equal to x = 500. How would using the standard deviation and mean of the training set, and a seperate standard deviation and mean of the cross-validation/test set make any difference to the prediction?
For example, if x = 500 was in the training set and is scaled down to 0.5, it still maps to y=300. If x=500 was in the cross-validation/test set and is scaled down to something else like 0.3, it would still map to y = 300.
You missed the line where it mentions the user uses the model designated that was trained.
While using any deployed model, it is very important to make sure the distribution of any data tested on trained model should have the same data distribution, and scaling on different features can cause I correct predictions as the features for the dataset with trained model will not be matching with user using a different scaled features.
Z score is one of the method of normalization of data, transforms each feature to have a mean of 0 and a standard deviation of 1.
The statement in the image actually mentions if the user don’t use sample of 500 features than the scalability will not 0.5 as per the deployed model, so the y output will not be equal to 300.
I hope you know any dataset set, trained or tested need to have the same distribution of data to have a the right prediction as the mean and standard deviation would change for different distribution of data.
Ah ok thanks I wasn’t aware that when using a model, the distribution of any dataset entered into it, needs to match the distribution of the training set.
So is it the case that when you normalise a dataset by the mean and standard deviation of the training set, the distribution will match the training set?
Would you also be able to explain why the distribution of any dataset entered into a model needs to match the distribution of the training set? Or if you have any helpful resources you could link to.