C2W4 Feature Engineering code for Weather data and Accelerometer data

I saw for both lab there are plenty of packages introduced, like Apache Beam, tfx_bsl etc… but why didn’t I see them in assignments previous weeks?

Also there isn’t too much hint like links present in previous lab to doc, it makes difficult to fully understand the concrete logic. Like the what’s the output of one function and how this output feed into the follow up function as input and make impact.

BTW, what’s the relationship between Beam and TFX? Both said to be for data processing pipeline. Since I’m new to both it quite challenge and a little lost when they both present in code.

Hello @Feihong_YANG
Labs are commonly used to introduce new packages or libraries based on the specific concepts or topics being covered. There are some more hints and link like you’re suggesting but you will find them in the “Reading” or “Resources” section which is always probably next to any lab.

Regarding the relationship between Apache Beam and TFX, both are frameworks for building data processing pipelines, but they serve different purposes. Apache Beam is a unified programming model and set of libraries for building batch and streaming data processing pipelines that are portable across different execution engines while TFX (TensorFlow Extended) is an end-to-end platform built on top of TensorFlow for building and deploying production-ready machine learning pipelines.

Hope this helps.

Hey @Isaak_Kamau , per your comment

“egarding the relationship between Apache Beam and TFX, both are frameworks for building data processing pipelines, but they serve different purposes. Apache Beam is a unified programming model and set of libraries for building batch and streaming data processing pipelines that are portable across different execution engines while TFX (TensorFlow Extended) is an end-to-end platform built on top of TensorFlow for building and deploying production-ready machine learning pipelines.”

I can see how TFX is applied by calling TFDV, TFT etc… to build the ML pipeline from the course material. But for the benefit of Apache Beam I’m still confused. I just went over some document from TF transform and some docs from Apache Beam but still didn’t see the concrete value it bring in. Probably it unify the programming code as it mention in the official doc said “A simplified, single programming model for both batch and streaming use cases for every member of your data and application teams.” means that it’s portable to whichever platform you want, you can still call the beam API, like how it replaced the Spark code in this doc when it running on Spark runner. And it said the API built based upon the Beam Model, probably the paper referred defined the standard in the field of ETL/data processing pipeline which made it meet the most requirement in the industry.

Is that the only value it bring in that you don’t need to change too much of the source code when migrating your project from one platform (like Spark) to another (like Google Cloud Dataflow)? BTW I did see it also unify the datatype on the TFT get_started since PCollection can be manipulated in whichever platform through Beam API. But is there any case in detail to display that some goal cannot be achieved if we only apply TFX standalone?

@Feihong_YANG
I like your insightful observations about the matter!
While TFX provides many valuable components for building machine learning pipelines, I think in some of the situations like when you are working with Streaming Data Processing /Real-time Processing with Apache Beam, you can analyze and process the incoming data in real-time, enabling you to gain valuable insights and take immediate actions based on the analyzed data. TFX primarily focuses on batch data processing for training and serving machine learning models. While it supports some streaming capabilities, it may not offer the same level of flexibility and scalability as Apache Beam when it comes to processing real-time streaming data. If your use case involves large-scale streaming data processing, Apache Beam provides a more comprehensive set of features and optimizations specifically designed for streaming pipelines. I think that is among of the situation you would want to work with Apache Beam on top of it’s ability for Cross-Platform Compatibility and Ecosystem Integration. Hope this helps

@Isaak_Kamau Thanks for the explanation in detail! I see, if that’s the case then I can understand why it’s author named it … Beam… :laughing:
BTW I think this is an important topic, the motive to take the course MLOps is about taking advantage of machine learning technology to design & build digital product / service. Understanding the advantage / disadvantage of different platform to address different business use case would be critical during the design phase and would make great impact of the product success.

1 Like