Hi. In what circumstances should be be performing full pass in production and in what scale? For example, do we need to perform a min/max based on past 10 hours of data, or can’t we do that based on the stats from the training set?
Normally you do a full pass only on training data and then you use the “transformation” parameters captured during the training phase.
In production normally you process data a batch a time and the batch could be not so big, therefore, for example, min/max computed on the batch couldn’t be really significant.
The good things about TFX is that it captures all the info and then can applies to serving time.
Ok thanks. So essentially its wisest to use the ‘constants’ deriving in training set and apply these to production data.
Basically, the training set is very often much bigger than serving batch, and therefore numbers used for normalization, to give an example, are much likely to be appropriate.
But, during serving time, we should monitor to identify that there is not a data drift. IN that case you need to investigate, eventually, confirm and then take appropriate action like redefine the transform steps and restrain the model.
Today this well-organized and monitored pipeline often is not implemented… this is why all the discussions we’re doing in this specialization are so important… there is the need for awareness around these subjects… then it takes time and effort to implement these approaches.