Feature Engineering - there's more to it?

After listening to the course, I still have my doubts if TensorFlow can or even is supposed to handle the entire extent of feature engineering. And whether it’s indeed the “raw data” that is beeing fed into transformations. It might be more true with unstructured data… but how about the structured data? Am I missing something?

Let me try and set up an example (any similarities with the real world are not intended):

Company is selling shoes online. And wants to improve recommendation system. It is tracking actions while user is logged in. And to set up labeling, it is reusing reaction to recommendations from very simple system they had in place earlier. So we have features and have labels of successful reactions to past recommendations. And if you squint at this setup, it’s basically action-reaction classifier that is widely used in finance, sales, you name it…

Now the part that I don’t understand:
Would you really feed all the logs of all the online sessions (real raw data) into TF, just to compute ‘clicked_the_green_button_in_last_3_weeks_feature’? To me it seems that computing logic ‘date_last_clicked_the_green_button’ outside of TF is more efficient (?), and hence, it’s another data pipeline that needs to be handled somewhere else (?). And the ‘…3_weeks…’ part here is the result of Data Scientist figuring out what makes an important feature in context of this particular model that is being developed, and logic can be embedded into TF. While the ‘date_last_clicked…’ is a general purpose feature that can be reused for other models and resides in a separate feature store(?).

TL;DR do I understand it correctly that entire logic of data transformation can be handled in TF? Or do I misunderstand the term ‘raw data’ used in context of this course?

Hi @ArtursValujevs

Thanks for the interesting question.
Well, I would say it depends on what you mean by TF.
One of the area that has greatly improved from the beginnings in TF is the tools we have to handle structured data.
We have now the Feature Columns API, we can code custom layers in order to embed in the model many pre-processing steps. In addition, there is the interesting addition of TabNet, specifically designed for tabular data.
But, the more, there is all the TFX ecosystem, with TF Transform (that is backed up by Apache Beam)
In my view, it is a designer’s choice. You could do everything (in the collection/statistics/preprocessing) with TFX using all the available tools inside… you could adopt a mixed approach, using other tools for data preparation (for example from the Spark world).
In my view, you can do almost everything with TFX

1 Like

Thanks!

This gives me more reason to explore TF as a potentially better alternative to what I use now!

I will expand my perspective in case it could be useful to other readers: I’m comming from personal experience, where ‘feature engineering’ has two very distinct paths:

  1. Logical feature engineering, where if you look at background of people, it’s more of a task for data analyst/engineer with strong domain knowledge, and end product is transforming data into information that has potential to be useful (but not necessarily explicitly tailored to specific use case, and not necessarily to serve ML solutions directly). It involves combining various data sources, setting up pipelines, defining critical points on timelines and parametrizing calculations relative to time ensuring no future leaks, knowing quirks and interfaces of data sources, generally being creative about properly expressing data, understanding the computation/serving limitations from sources etc.
  2. Technical feature engineering, background - data scientist. This part was extensively covered in this course. It’s fine tuning the given information to fit the context of problem that is being solved.

This course left me somewhat confused of blending together two very distinct things (at least how I perceive them). And setting up the logical pipelines being underrepresented. My concern is that the ‘logical’ generally gives more value than the ‘technical’. And is WAY more complicated to get right. Fitting simple model to a better data will always beat fitting better model to a simple data.

BUT, it’s two completely different toolsets & workflows. And at certain scale, it makes it very appealing to look into ways to get both on the same platform. Hence, thanks a lot for giving me pointers!

Bonus thought: there tends to be ANOTHER round of ‘feature engineering’ when it comes to applying model results, where end product is explicit action taken. Probability of an outcome (in case of binary classifier) is not what is being fed into systems consuming it. It’s usually another set of engineered variables (more closer to process representation) + the probability & varying cut-offs for probabilities in context of these variables. Basically, an experimentation platform where applying model can be fine tuned, versioned, and tracked.