You are right: the pre-processing that you apply to your train/test dataset has to be also applied to your production data.
If you, for instance, impute using mean() in training/test, you would need the same type of imputation for production.
If you normalize with min-max in train/test, you need to do the same in production for inference.
If you use one-hot encoding for categorical features, you need to use the same technic in production.
And so on…
Now, how to do this pre-processing in production? it depends on the framework you use. Tensorflow and torch will offer you options to pre-process the data when you pack your model for production. To give you a quick example, the following method gets a model for inference:
def get_inference_model(model):
inputs = tf.keras.Input((maxLen, 1629), dtype=tf.float32, name=“inputs”)
for i in range(1, len(model.layers)):
if i == 1:
x = model.layersi
else:
x = model.layersi
x = tf.reduce_mean(x, axis=0)
output = tf.keras.layers.Activation(activation=“linear”, name=“outputs”)(x)
inference_model = tf.keras.Model(inputs=inputs, outputs=output)
inference_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=[“accuracy”])
return inference_model
In this case, I did some normalization with mean() for the training/test. Here, when creating the model for inference, I added a line to apply the same pre-processing for the production data:
x = tf.reduce_mean(x, axis=0)
This can be much more complex than this, and each framework will give you the options to do so.
Hope this helps.
Juan