I have a question about how scaled features are served in applications/production environments.
Lets say I have a feature “F”. As part of my model training process, assume i have scaled feature F using z-score normalization. I recall from the class that feature scaling can speed up the training process as it reduces the number of iterations gradient descent needs to find the local/global minima. Lets say the “best” model was found and saved.
What would the next steps typically look like?
Serve feature F in some feature store, where F is maintained as the scaled version
Retrain the model on the unscaled version of feature F. Feature F is then served in a feature store as its original form
For (1), is the mean and standard deviation computed on the fly everytime? What is the best practice here? Or do we not do (1) but rather do (2)? Thanks for reading!
I guess my main concern is that (in my current organization), there is a strict service level agreement whereby my model has to return a response within some X duration.
When a system calls my model, all the features are generated via a feature store. I am unsure if computing normalized features every single call will violate the SLA.
Feature scaling does not explicitly improve model performance iirc. If this is the case, i guess it would be a good tradeoff to train a model on unnormalized features (even if it is redundant) just so my feature store computes faster?
Curious to know if youd come to the same conclusion as me
The only reason to re-train a model is if you are adding new training examples.
So, you can save the model (the weights and biases), so you can use them to make new predictions. You’ll also need to save the normalization parameters (the mean and sigma, or whatever normalization you used in training), so that you can apply the same normalization to any new predictions you want to make.
If you want to re-train, you might batch up the new examples for a while, and then re-train the model periodically.
Exactly what strategy you use depends on how big the data set is, now much computing power it takes for training, how often you get new training data, and how up-to-date you want the models to be for making new predictions.
Recommendation: Do not train your model on un-normalized data. The benefits of normalization are many, and the penalties for not using normalized features can be extreme, unstable, and extremely difficult to debug.
Sometimes, without feature scaling, during training, your model never converges to any minimum, so this is not about speed anymore. Given this possibility in mind, we should agree that (2) is not always possible?
Personally, I would keep F in its original state, because F can serve more than one model and/or one version of model. If these models are supplied with a different subset of F, then obviously, they require F to be scaled differently. In this case, what is the easiest way of maintaining them without having one copy of F per model?
Then I am afraid you just need to find out.
Z-score scaling takes one subtraction and one division. If your model is already large, those two operations might just be nothing.