You are right: the pre-processing that you apply to your train/test dataset has to be also applied to your production data.

If you, for instance, impute using mean() in training/test, you would need the same type of imputation for production.

If you normalize with min-max in train/test, you need to do the same in production for inference.

If you use one-hot encoding for categorical features, you need to use the same technic in production.

And so on…

Now, how to do this pre-processing in production? it depends on the framework you use. Tensorflow and torch will offer you options to pre-process the data when you pack your model for production. To give you a quick example, the following method gets a model for inference:

def get_inference_model(model):

inputs = tf.keras.Input((maxLen, 1629), dtype=tf.float32, name=“inputs”)

for i in range(1, len(model.layers)):

if i == 1:

x = model.layersi

else:

x = model.layersi

x = tf.reduce_mean(x, axis=0)

output = tf.keras.layers.Activation(activation=“linear”, name=“outputs”)(x)

inference_model = tf.keras.Model(inputs=inputs, outputs=output)

inference_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=[“accuracy”])

return inference_model

In this case, I did some normalization with mean() for the training/test. Here, when creating the model for inference, I added a line to apply the same pre-processing for the production data:

x = tf.reduce_mean(x, axis=0)

This can be much more complex than this, and each framework will give you the options to do so.

Hope this helps.

Juan