Data validation for unstructured data

Hi, In week 1, understand monitoring data is important in production ML, yet, examples in lab and assignment refer to structured data, I’m wondering what about unstructured data, is there any tool can be used for monitoring data?

Hi Joyee, thanks for your posting.
It is a really interesting question. As far as I know, all the examples/tutorials around TFDV are about structured data. With unstructured data, you cannot normally infer a schema directly from data and you have to add manually to your pipeline information around what is expected.
What you’re expecting depends on the kind of data… it is different for images than text.
The important thing is that you get a very general approach that then needs to be customized for your project.
From my experience, one tool that you often reuse with unstructured data is the TFRecord file format. If you’re using GPU and TPU you want your training loop to be as fast as possible and not constrained by file I/O. For that reason, normally you would structure your ingestion pipeline in order to have data packed in TFRecord files.