Balanced feature engineering pipeline for multiple tasks/set of labels

The first lessons say that the feature engineering pipeline will split in train/val/test and stratify by balancing on the labels.
But then one of the characteristics of the feature store is to re-use the same features on different projects. Assuming each project has a different task and a different set of labels, aren’t the two things contradictory?

Hi @joaomario-brainly, welcome to our community!

The reason for balancing by labels is to address ‘Class Imbalance’ in the dataset while also guaranteeing dev/ test sets same distributions. This data transformation process is to mitigate data mismatching issues. You may consider as “rule of thumb” to always make sure dev and test come from same distributions.

Making sure you are balancing the classes is an agnostic process to different problems/ hypothesis classes. If you are using the same dataset and features to other tasks, the feature store is very handy with this whole process already stored.

I hope my explanation helps!

Hi Raul, it is clear to me the benefits of stratified sampling (labels balancing) when producing train/val/test splits.
What I am not sure about is the abstraction layer generate by the feature store in orde to re-use the same features on multiple project. If the feature engineering process include the transformations on the train/val splits, thus depends on the labels, thus depends on the task. Then how can this process by generalizable to different projects?

Thanks for following up. They will be generalized into other problems that will have the same set of features/ labels as input to training. The label is a feature by itself, being the one you want to predict. Thus, in this case it will use the same labels to different hypothesis in the same class.

To use different labels (I mean by labelling the same data differently), it will make the original dataset different because the labels are part of it as a feature in supervised learning. A way that helps me reflect on this is to make an analogy to regression problems when doing classification. If you change the label (a continuous variable), it will make a whole different feature and therefore the dataset will be different.
Thus, my understanding is that you’d have to go through the feature engineering process again from the raw data starting point.

1 Like

It was missing to highlight the value of feature storage as a place you securely save and serve features from. This video explains this part:

In my opinion, the best is having the features you created stored in a sort of “shelf” (in our exercise, the BERT Embeddings) that you can query, compared to the need to run cell all over again every time you start the kernel…


1 Like

Thank you Raul.

I conclude from our discussion that we should partition our feature groups also based on the task, having one feature engineering pipeline (including the train/val/test split if required) for each task.
Some features are more generalizable, such as the original BERT embeddings, since that they have not been fine-tuned on any specific task they can be saved and generalized for several classification tasks on the same text data.
Let’s suppose my features are a bit more advanced than that, e.g. we are applying some pre-processing function to the text (e.g. removing common words) and this pre-processing needs to be learnt in a supervised way. Then we would have one feature engineering pipeline for each task, since that the set of training data (and labels) will be different for each task, and we should create a different feature group each time.
In the latter scenario, the feature store will not create a re-usability layer between tasks and it is tied to the specific set of training data and splits, also the downstream models should be trained on the same splits otherwise we invalidate the methodology. Where would we store the information about what should be used as train/val/test, should that be an additional metadata feature group?


1 Like

Yes, you assumed it right. You will see in Course 2 of this series details about it (if you haven’t already).
It seems that the main advantage to you is the store being able to serve the whole team throughout the organization keeping track of all datasets, models and deployment versions.