C2W2 ScikitLearn Pipelines potentially useful example

Daniel_Adornes · January 8, 2023, 2:15am

Hey community!

For anyone using Scikit-Learn (and not TensorFlow), I thought about demonstrating how some key principles from this module can be reproduced in a simple way with Scikit-Learn.

With scikit-learn pipelines it is possible to create a sequence of steps/transformations that must be applied not only on the training dataset, but also on testing and serving data. Scikit-Learn Pipelines also follow the same standard of storing constant values like quantiles, vocabulary, among others that will be used downstream to consistently apply the same transformation steps in other datasets. Moreover, fit operation is performed only over the training dataset, thus avoiding data leakage to downstream process.

Assuming that F-test is our choice from previous experimentations and also assuming the scaling is an inherent part of our pipeline, we can build it as follows:

from sklearn.pipeline import Pipeline

X = df[names]

# Split train and test set
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, stratify=Y, random_state=123
)

pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        # User SelectKBest to select top 20 features based on f-test
        ("feature-selection", SelectKBest(f_classif, k=20)),
        # Scoring algorithm always as the final step of the pipeline
        ("rf", RandomForestClassifier(criterion="entropy", random_state=47)),
    ]
)

pipe.fit(X_train, Y_train)

# pipe.score(X_test, Y_test)

# The pipeline object is abstracted by ScikitLearn as a model.
# Testing dataset doesn't need to be scaled upfront as this
# transformation is now part of the model (the pipeline).
acc, roc, prec, rec, f1 = calculate_metrics(
    model=pipe, X_test_scaled=X_test, Y_test=Y_test
)

# Construct a dataframe to display metrics.
pipeline_eval_df = pd.DataFrame(
    [[acc, roc, prec, rec, f1, 20]],
    columns=["Accuracy", "ROC", "Precision", "Recall", "F1 Score", "Feature Count"],
)
pipeline_eval_df.index = ["Pipeline"]

# Append to results and display
results = results.append(pipeline_eval_df)
results.head(10)

The snippet above being added to your personal notebook, will append a new row with the same metrics as the F-test experimentation.

Other ML libraries, as Spark ML, also offer similar “pipeline” functionality for the same purposes (reproducibility, training vs serving, etc.). So far, I understand that one of TensorFlow’s advantages is the metadata store and how it ensure strong traceability/lineage for real-world use cases (that potentially deal with auditability,etc.). Would love to hear more thoughts around this point as well.

Cheers,
Daniel

balaji.ambresh · January 8, 2023, 5:47am

Thanks for sharing.

Tensorflow offers pipeline services at scale which makes it useful. See vertex-ai to see how you can build and manage models at scale.

Topic		Replies	Views
Imputing in "Transform" Machine Learning Data Lifecycle in Production	2	535	June 10, 2021
Week 3 - Data Pipeline Comps for ML Prod: Sklearn and preprocessing Machine Learning Data Lifecycle in Production	3	429	August 4, 2023
Scikit Learn Pipeline (Custom Transformation) AI Discussions ai-discussions , project	1	25	December 15, 2024
C2W2: Error within preprocessing_fn Machine Learning Data Lifecycle in Production	1	667	June 25, 2021
C4_W3_Graded_Lab_1 : csvexamplegen did not work Deploying Machine Learning Models in Production	11	575	November 3, 2021

C2W2 ScikitLearn Pipelines potentially useful example

Related topics