What would be the lifecycle of unstructured data?

in the course, a typical data life cycle is described as:

Which totally makes sense for structured data.

But I was wondering how can this transfer to unstructured data such as images. I believe the need for tracking is equally important.

More specifically, what would be the statistics and schema for example in this case? Also how can we define anormalies? Data drift? If I take the phone scratch example that Andrew presented in course 1, if the light exposure diminish over time, are there possible ways to detect this shift in the dataset?

Or maybe we need a totally different processing strategy, with some other modules of TFX?

Many thanks

one could calculate the mean brightness of images with OpenCV or numPy, and add it as an attribute to the TFExamples.
That way it can be tracked over time.

1 Like

Yes, basically this could be a good example.

From a conceptual point of view, it is the same. You need to organize your pipeline in such a way that it is repeatable, and you need to ensure that the same transformation are applied at training and serving time (for example, with images, you often normalize the pixel intensity).
In addition, it is important to ensure that in the time there is no data drift. The mentioned is one of the metrics you need to keep under control, but, for example, if you resize the images, you need to ensure that the transforms are still valid.
The tools you have seen with TFX are mostly for structured data, but there are many good tools also for images. For example, remaining in the TF/Keras space, in TF2 many transformations can be inserted as part of the model.

1 Like

Thanks for the replies smedegaard and luigisaetta.
Yes I totally see the need to track the transformations that are applied to the images, but the thing that wasn’t clear to me is about statistics and schemas, to detect data skew.

So if I understand correctly, best way is to hand craft some additional features (such as mean brightness value) and use them to compute statistics, and I correct?

I’m fairly new to the subject myself. But I think that’s the idea.
The features of image data needs to be extracted/engineered much like the features of structured data.

If you are using Tensorflow the images should be converted to the TFExample format and stored in TFRecords.

This is the structure of the TFExample, from an Apache Beam pipeline I wrote recently:

features = tf.train.Features(
        "image/encoded": bytes_feature(image_raw),
        "image/object/class/label": int64_list_feature(label_list),
        "image/object/class/text": bytes_list_feature(class_text_list),
        "image/object/bbox/xmin": float_list_feature(xmin_list),
        "image/object/bbox/xmax": float_list_feature(xmax_list),
        "image/object/bbox/ymin": float_list_feature(ymin_list),
        "image/object/bbox/ymax": float_list_feature(ymax_list),
        "image/filename": bytes_feature(image_name.encode()),
        "image/hash/sha256": bytes_feature(value=sha256_hash),
        "image/height": int64_feature(height),
        "image/width": int64_feature(width),
        "image/num_channels": int64_feature(channels),

example = tf.train.Example(features=features)

This basically becomes the schema and can be extended to include mean brightness as a float or whatever makes sense. If the brightness then changes a lot it can be picked up by the validator.

If @luigisaetta or any other mentor can confirm this I would be thankful. As I said, I’m also new to this :grinning:


Hi @smedegaard

yes, that could be a possible and good approach.

Normally, we talk about Schema if we’re talking about structured data.
For unstructured data, the concept of schema is less defined, but anyway the Data Scientist, together with a SME, can figure out what are the expected characteristics of the data they’re working on. For example, referring again to A. NG example: we’re expecting than the mean brightness is not lower than MB… and we want to keep this under control during the ML pipeline.
So, your approach could be the right one… the actual things you want to keep under control must be defined when you examine the context.

1 Like

Thanks for the answer Luigi!

By schema I mean a Tensorflow Data Validation schema. This can be inferred from the TFExamples if I understand this correctly.

Hi @smedegaard
I see.
I was talking in a more general way: normally, as you know the difference between structured and un-structured data is that the concept of Schema for the latest is less defined.
In the BigData world in fact we talk about “schema on read”. The schema is not defined when you write the data to a DataLake but is defined when you read the data.
I think it is more or less what you’re proposing with your approach.

1 Like

Hi, i was looking out for more and saw this. Might help.