Data cleaning in TFX

Hi typically what would be the setup if I need to do some data cleaning? Can the data cleaning process be included in the TFX pipeline? If we can include data cleaning within the pipeline, what are the API’s for handling for example: changing some values in a column (e.g. conditional changes based on a column value or values from multiple columns), reading a tsv instead of a csv, imputing values in a column, changing the column type (e.g. IDs being read in TFX as integers instead of categorical or string)?

Hi! I think all the examples you gave (except for reading TSV) can be written as part of the Transform module. You can write the logic there before it outputs the transformed raw input. For reading TSV, I don’t think TFX handles it yet. You will have to convert it to CSV or tfrecord before you feed it into CsvExampleGen or ImportExampleGen.

However, you can also do your data cleaning outside TFX. You can just setup your Schema to set boundaries and expected types, then use the output of ExampleValidator to guide you in data cleaning. Afterwards, you can just feed the clean data to your pipeline so it can be transformed into features for training.

Hope this helps.

2 Likes

I had the exact same question. Thanks @chris.favila for the answer but I have follow-up questions.
When you say “you can just feed the clean data to your pipeline so it can be transformed into features for training”, but then what happens to the serving data? We need the cleaning to be applied to it too.
An easy example I have in mind, is harmonizing country names in a dataset, e.g. “U.S.”, “US”, “USA”, “United States of America”, and converting all those to one common naming convention. Using SchemaGen and ExampleValidator we can detect those and correct it but again, the same corrections must be applied for serving.

From what you are saying “I think all the examples you gave[…] can be written as part of the Transform module” I understand we can embed that into the file code that is used during Transform. But a question I would have is, is this the best practice? and are there other ways using TFX to ensure the same cleaning done to prepare the training data is applied to serving data?

Thank you in advance