Model deployment in the real world question ... data preprocessing

Hi all,
New here.
I have done some TensorFlow training over the few years but have never had a chance to work on a real project until recently.
Now most of the ML training seems to stop at building the model and checking the loss value… but not actually deploying it and using real data, hence my question.
I am working on a model on water pollution. The input data contains both numerical data as well as locations, categories etc. The data has been scaled and categories encoded. So far so good. Now if I want to test my model with sample data, do I need to prepare it the same way (i.e scale it and encode categories etc the same way)? I assume so but this is a topic usually not covered in training. How is this done with production applications?
Many thanks

Yes. This is because the model was trained on pre-processed data, so its predictions are only going to work on data that was pre-processed in a compatible way.

For example, if you normalized the training data, you need to apply the same normalization to the test data. That doesn’t mean you normalize the test data independently separately - you apply the normalization you got from the training set.

1 Like

Thanks @TMosh ,

Ok, so it means that when you deploy your model in prod, you need to have a pre-processing layer which is exactly the same as the one you used in developing your model, correct?
So let’s say I need to build an app that will try to predict the likelihood of a water pollution event whenever a water spillage is reported/detected. Each time I create a new record in my app I need to normalise its data for sending to the model…

Correct.

1 Like