Hey community!
For anyone using Scikit-Learn (and not TensorFlow), I thought about demonstrating how some key principles from this module can be reproduced in a simple way with Scikit-Learn.
With scikit-learn pipelines it is possible to create a sequence of steps/transformations that must be applied not only on the training dataset, but also on testing and serving data. Scikit-Learn Pipelines also follow the same standard of storing constant values like quantiles, vocabulary, among others that will be used downstream to consistently apply the same transformation steps in other datasets. Moreover, fit
operation is performed only over the training dataset, thus avoiding data leakage to downstream process.
Assuming that F-test is our choice from previous experimentations and also assuming the scaling is an inherent part of our pipeline, we can build it as follows:
from sklearn.pipeline import Pipeline
X = df[names]
# Split train and test set
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, stratify=Y, random_state=123
)
pipe = Pipeline(
[
("scaler", StandardScaler()),
# User SelectKBest to select top 20 features based on f-test
("feature-selection", SelectKBest(f_classif, k=20)),
# Scoring algorithm always as the final step of the pipeline
("rf", RandomForestClassifier(criterion="entropy", random_state=47)),
]
)
pipe.fit(X_train, Y_train)
# pipe.score(X_test, Y_test)
# The pipeline object is abstracted by ScikitLearn as a model.
# Testing dataset doesn't need to be scaled upfront as this
# transformation is now part of the model (the pipeline).
acc, roc, prec, rec, f1 = calculate_metrics(
model=pipe, X_test_scaled=X_test, Y_test=Y_test
)
# Construct a dataframe to display metrics.
pipeline_eval_df = pd.DataFrame(
[[acc, roc, prec, rec, f1, 20]],
columns=["Accuracy", "ROC", "Precision", "Recall", "F1 Score", "Feature Count"],
)
pipeline_eval_df.index = ["Pipeline"]
# Append to results and display
results = results.append(pipeline_eval_df)
results.head(10)
The snippet above being added to your personal notebook, will append a new row with the same metrics as the F-test experimentation.
Other ML libraries, as Spark ML, also offer similar “pipeline” functionality for the same purposes (reproducibility, training vs serving, etc.). So far, I understand that one of TensorFlow’s advantages is the metadata store and how it ensure strong traceability/lineage for real-world use cases (that potentially deal with auditability,etc.). Would love to hear more thoughts around this point as well.
Cheers,
Daniel