Feature Engineering - there's more to it?

ArtursValujevs · July 3, 2021, 7:09pm

After listening to the course, I still have my doubts if TensorFlow can or even is supposed to handle the entire extent of feature engineering. And whether it’s indeed the “raw data” that is beeing fed into transformations. It might be more true with unstructured data… but how about the structured data? Am I missing something?

Let me try and set up an example (any similarities with the real world are not intended):

Company is selling shoes online. And wants to improve recommendation system. It is tracking actions while user is logged in. And to set up labeling, it is reusing reaction to recommendations from very simple system they had in place earlier. So we have features and have labels of successful reactions to past recommendations. And if you squint at this setup, it’s basically action-reaction classifier that is widely used in finance, sales, you name it…

Now the part that I don’t understand:
Would you really feed all the logs of all the online sessions (real raw data) into TF, just to compute ‘clicked_the_green_button_in_last_3_weeks_feature’? To me it seems that computing logic ‘date_last_clicked_the_green_button’ outside of TF is more efficient (?), and hence, it’s another data pipeline that needs to be handled somewhere else (?). And the ‘…3_weeks…’ part here is the result of Data Scientist figuring out what makes an important feature in context of this particular model that is being developed, and logic can be embedded into TF. While the ‘date_last_clicked…’ is a general purpose feature that can be reused for other models and resides in a separate feature store(?).

TL;DR do I understand it correctly that entire logic of data transformation can be handled in TF? Or do I misunderstand the term ‘raw data’ used in context of this course?

luigisaetta · July 5, 2021, 7:05am

Hi @ArtursValujevs

Thanks for the interesting question.
Well, I would say it depends on what you mean by TF.
One of the area that has greatly improved from the beginnings in TF is the tools we have to handle structured data.
We have now the Feature Columns API, we can code custom layers in order to embed in the model many pre-processing steps. In addition, there is the interesting addition of TabNet, specifically designed for tabular data.
But, the more, there is all the TFX ecosystem, with TF Transform (that is backed up by Apache Beam)
In my view, it is a designer’s choice. You could do everything (in the collection/statistics/preprocessing) with TFX using all the available tools inside… you could adopt a mixed approach, using other tools for data preparation (for example from the Spark world).
In my view, you can do almost everything with TFX

ArtursValujevs · July 6, 2021, 7:13am

Thanks!

This gives me more reason to explore TF as a potentially better alternative to what I use now!

I will expand my perspective in case it could be useful to other readers: I’m comming from personal experience, where ‘feature engineering’ has two very distinct paths:

Logical feature engineering, where if you look at background of people, it’s more of a task for data analyst/engineer with strong domain knowledge, and end product is transforming data into information that has potential to be useful (but not necessarily explicitly tailored to specific use case, and not necessarily to serve ML solutions directly). It involves combining various data sources, setting up pipelines, defining critical points on timelines and parametrizing calculations relative to time ensuring no future leaks, knowing quirks and interfaces of data sources, generally being creative about properly expressing data, understanding the computation/serving limitations from sources etc.
Technical feature engineering, background - data scientist. This part was extensively covered in this course. It’s fine tuning the given information to fit the context of problem that is being solved.

This course left me somewhat confused of blending together two very distinct things (at least how I perceive them). And setting up the logical pipelines being underrepresented. My concern is that the ‘logical’ generally gives more value than the ‘technical’. And is WAY more complicated to get right. Fitting simple model to a better data will always beat fitting better model to a simple data.

BUT, it’s two completely different toolsets & workflows. And at certain scale, it makes it very appealing to look into ways to get both on the same platform. Hence, thanks a lot for giving me pointers!

Bonus thought: there tends to be ANOTHER round of ‘feature engineering’ when it comes to applying model results, where end product is explicit action taken. Probability of an outcome (in case of binary classifier) is not what is being fed into systems consuming it. It’s usually another set of engineered variables (more closer to process representation) + the probability & varying cut-offs for probabilities in context of these variables. Basically, an experimentation platform where applying model can be fine tuned, versioned, and tracked.

Topic		Replies	Views
Feel like I am not understanding the TF functions Machine Learning Data Lifecycle in Production	7	567	February 27, 2022
Week 3 - Data Pipeline Comps for ML Prod: Sklearn and preprocessing Machine Learning Data Lifecycle in Production	3	429	August 4, 2023
C2W2 practice quiz Feature Engineering and preprocessing in wrong place Machine Learning Data Lifecycle in Production	1	559	July 18, 2021
Summary of Course 2 Machine Learning Data Lifecycle in Production	2	574	September 29, 2021
C2 TF data validator with non-structured data Machine Learning Data Lifecycle in Production	2	563	October 13, 2021

Feature Engineering - there's more to it?

Related topics