I have a hypothetical question.
If we, let’s say, train a classifier network after modifying (centering/scaling/normalizing) some features of the labeled examples, and, then, ask the trained network to classify an unseen example:
Do we apply the same centering/scaling/normalizing that we applied to the specific features to the unseen example’s corresponding features? For instance, if we had used the z-scores for a certain feature during training, do we use the same mean and standard deviation (as observed from the training examples) to calculate z-score of that particular feature of the unseen example, and, then feed it to the trained network for classification?
Please feel free to refer me to some reading as well.
Thank you in advance.
Yes, the input on predictions has to have the very same modifications, engineering, etc that you applied to the data used during training.
This means that whatever transformation you effect on the training datasets has to be carefully documented so that it can be applied to the new data when the model will be doing predictions.
Thank you, @Juan_Olano. I appreciate you response.
What are some ways to keep that kind of documentation considering the person who trained a network may not be there later on to utilize the trained network? Are there any commonly used (standard) tools for that, or one would look at the training code to deduce that information when needed?
Documentation can be done in a simple tool like a word processor. There are also some more sophisticated platforms, like the one provided by landing.ai which specializes in computer vision development. There are others that also provide support in the management of data and of the models themselves.
The code should also always be properly documented, of course.
The people who launches and runs the model are not necessarily the same that develop the model, so good documentation should be provided and again this can be in very simple tools like word processors.
I personally use word processing documents and excel too.
May be to add to this topic, some common pre-processings done to data are:
Normalization or standardization of the data, which scales the data to have a mean of 0 and a standard deviation of 1. This can help the model converge faster and can improve its performance.
Imputing missing values in the data. This is often necessary because many machine learning algorithms cannot handle missing values.
Removing outliers from the data. Outliers can have a negative impact on the performance of some machine learning algorithms.
Encoding categorical variables as numeric values. Many machine learning algorithms require that the input data be numeric, so categorical variables need to be encoded as numbers.
Feature selection, which involves selecting a subset of the available features to use in the model. This can help improve the performance of the model and can make it easier to interpret the results.
These are some of the most common data preprocessing. And again, anything you decide to use, should be documented so that data in production can be pre-processed in the same way.
One more important thing to keep in mind:
While developing the model, the team can implement the different functions and procedures through which data is preprocessed. This is referred to as the ‘pipeline’, which is a sequence of steps followed to pre process the data.
If the development team creates this pipeline, then the Operations team (in charge of launching and maintaining the model) can use that same pipeline to pre-process new data.
This is very valuable information.
Can you please recommend, when you get a chance/time, an article that reviews these preprocesses (creating the pipeline) and its potential effects? For instance, dealing with missing values and how to select a subset etc.
Check out this LINK which contains the most used steps of pre-processing. Also, a Google search on data preprocessing for machine learning will throw a lot of links that contain very useful information.
The pipeline can be built in python using libraries like those mentioned in the above article.
You’ll also get to a lesson in the current specialization that discusses several aspects of data preprocessing, so stay tuned