Merging Beam pipeline into TFX

smedegaard · June 3, 2021, 1:26pm

I’m working on an Apache Beam pipeline that converts images to TFrecords. So far it’s just converting the images into the format that’s required by the TF object detection API.

Now I’m wondering, how do I incorporate my existing pipeline in a TFX pipeline. I’m aware that TFX uses Beam. But can I just lift my pipeline straight into preprocessing_fn?

smedegaard · June 9, 2021, 8:55am

hi all @MLEP_Mentors

I’m bumping this question.

A bit more background:

I want to make a TFX pipeline for object detection. The idea is to fine tune a version of EfficientDet using the Neamura Bird dataset.

So I’ve created an Apache Beam pipeline that converts the dataset to the required format of Tensorflow Object Detection API.

But as we’ve learned in this course, we want all conversion of the raw input data included in the TFX pipeline to a void data skew and such.

So how would I go about including the Beam pipeline in the TFX pipeline?
Should I just copy paste the relevant ParDos into TFX?

Mubsi · June 9, 2021, 9:22am

Hey @smedegaard , please refrain from tagging entire group in posts, especially staff and mentor groups.

smedegaard · June 9, 2021, 9:38am

Oh sorry. I thought that was what the mentors were for. Answering students question.

Mubsi · June 9, 2021, 9:56am

That is correct. Mentors are for answering student questions. But also, please understand:

Firstly, you are asking MLEP specialisation related questions from Deep Learning Specialisation mentors. They won’t be able to help you out as they are not familiar with its content.

Secondly, by tagging entire groups, yes, some will answer you, but others will start getting unwanted notifications.

Thirdly, our mentors are very dedicated when it comes to helping learners. You don’t have to tag them to get their attention. They all get number of queries everyday so sometimes they might not be able to answer some of the queries immediately, but they do answer when they find time.

Mubsi · June 9, 2021, 10:07am

Hey @chris.favila , could you please answer @smedegaard 's query, thanks!

smedegaard · June 9, 2021, 10:09am

I only saw that tag for mentors. The question had gone unanswered for 6 days so I tried to bring it to attention for the relevant people.

How should the students reach out to the MLOps mentors, for future reference?

Sorry to all that got disturbed by this.

Mubsi · June 9, 2021, 10:20am

You can visit this page on Coursera, to look for who are the mentors for this course. You can ask them to help you out, but refrain from direct messaging them unless they ask you to.

I agree that your post has been unanswered for 6 days. We have given a nudge to the mentors to answer the pending queries. So hopefully soon either, one of the mentors or Chris will answer you back.

I’d have loved to help you out, but I’m also unfamiliar with this content.

chris.favila · June 9, 2021, 12:49pm

Hi @smedegaard ! Maybe we can resolve this starting from ExampleGen / start of your pipeline. What would be your input there? Do you have a CSV that contains the image paths, bounding boxes, etc… Or do you have TFRecords already that contain the images? Maybe you can upload some code so we can see the problem more clearly.

smedegaard · June 10, 2021, 6:56am

Hi @chris.favila, thanks for reaching out.

I’ve downloaded the data set as jpgs and annotation files as txt, and then converted them to TFRecords with this Beam pipeline: Anders Smedegaard Pedersen / birds_to_tfrecords · GitLab

The TFX pipeline is a mess at the moment. I just wanted to get a minimal viable pipeline working. It’s located here: Anders Smedegaard Pedersen / birds_ml_pipeline · GitLab

There’s a number of things that’s confusing me, but to stay on the topic of image preprocessing and the initial model:

Here’s what I’ve landed on so far:

The TFX pipeline cannot accept both images and TFRecords as input.
I should create a TFX Component that converts images to TFExamples . Like in Building Machine Learning Pipelines (p. 310).
- luckily I can reuse a lot of the code from my Beam pipeline.

An alternative might be to feed all raw images to my Beam pipeline first, and then feed the TFRecords to TFX. But this would create problems if I wanted to deploy the model on a edge device, I guess?

I would be very happy if you can confirm or deny these assumptions.

Regards,
Anders

chris.favila · June 11, 2021, 8:34am

Hi Anders! Thank you for the info! Please give me some time to go through them. We are currently busy with some tasks on the backend but rest assured that I have this thread bookmarked for review. At first glance though, I think you’re right and the best approach is to create a custom component so you can make the transformations work properly even on edge devices. But let’s see. Will get back to you.

Regards,
Chris

smedegaard · June 23, 2021, 6:30am

Hi @chris.favila
just going to bump this. Hope everything is well!

Regards,
Anders

chris.favila · June 25, 2021, 5:20am

Hi Anders! Sorry if it’s taking a while. We still have our hands full but I think the load would be lighter on Monday and I can look at your problem more closely. I’ve been practicing TFX lately too and hopefully, some of the concepts can also be applied here. Again, sorry for the delay and will catch up next week!

chris.favila · July 1, 2021, 1:58pm

Hi Anders! Just letting you know I didn’t forget about this. Was supposed to look at it earlier but power went down for quite a while. I’ll give it a shot tomorrow. Thanks.

chris.favila · July 2, 2021, 10:10am

Hi Anders!

I think for your application, the main thing you’ll need outside of the standard TFX components is the component to load the raw images from disk (i.e. the one done by your get_image_content() function). So you can pack that together with the image dimensions and labels into tf.Examples() then serialize them to TFRecords before feeding to ExampleGen. I think that’s what you’re already doing.

Maybe you can simplify that initial step and just use the raw image bytes and annotations in those TFRecords. All the reshaping, padding, changing datatypes, etc… can be handled in the Transform component (just like the Course 2 W4 ungraded lab with CIFAR10). That way, the transformation graph can be exported with the model and you will just need the raw input during inference. If you’re enrolled in Course 3, you might pick up some tips on the Transform and Trainer components with the TFX-related labs in W1 and W4.

Hope this helps!

smedegaard · July 2, 2021, 11:11am

Thanks Chris. It does help!

Topic		Replies	Views
Question on tfx pipeline preprocessing_fn, for mnist tfrecord dataset read by importexamplegen AI Discussions ai-discussions	0	77	February 19, 2024
Apache Beam vs TFX Transform :muscle:t4: Machine Learning Data Lifecycle in Production week-2	3	73	July 6, 2024
C2W4 Feature Engineering code for Weather data and Accelerometer data Machine Learning Data Lifecycle in Production	4	545	May 17, 2023
C4W3 - Going deeper about custom components Deploying Machine Learning Models in Production	1	530	October 1, 2021
Framework agnostic ML Pipelines Machine Learning Data Lifecycle in Production	3	571	June 14, 2021

Merging Beam pipeline into TFX

Related topics