TFDV infer_schema "Presence", "Valency" intiution

zef78653 · July 30, 2021, 6:58am

Hello,

I wanted to understand on what bases is the Presence and Valency variables are decided.

mjsmid · August 2, 2021, 7:15pm

Hello Huzefa, welcome to discourse!

In order to get a better intuition about the concepts of ‘Presence’ and ‘Valency’ in TFX, chapter 3 of this TFX paper is a good start.

A deeper dive into what triggers an anomaly for valency you can find here:

Presence has to do with data distributions and anomalies are detected if the distribution has drifted beyond a certain threshold, for instance using L-infinity distance or Jensen-Shannon divergence. See following 2 links for more details of the use in TFDV:

see “COMPARATOR_L_INFTY_HIGH” & “COMPARATOR_JENSEN_SHANNON_DIVERGENCE_HIGH” on the conditions for anomaly detection:

I hope this info helps clarifying.
Happy Learning!
Maarten

robertoz · August 24, 2021, 7:41am

Hello Maarten,

One question, why in TFX pipelines statisticsgen and schemagen are executed before transformation step? in case the transformation code is modified post production deployment or even during model development the validation checks won’t catch the data drift.

Regards
Roberto

mjsmid · August 24, 2021, 12:21pm

Hi @robertoz ,
Welcome to discourse and thanks for your question. I’ll try to answer it below;

We want to detect drifts or other anomalies in the raw untransformed data, in order to detect if the world has changed over time (which could decrease the model performance, so we want to be alerted early on by the example validator).
If the transformation code is changed we normally would know it, as usually we do not change the transformation between training and serving. Therefore, doing the drift detection after transformation would not give as additional useful information because we are already aware of the transformation update.
Hopefully this clarifies why we do the compute the stats and schema before transformation.

Best regards
Maarten

robertoz · August 24, 2021, 1:04pm

Hello Maarten,

Yes it clarifies my doubt thank you for your answer. In case the transformation is packaged together with the model it should be more linear. In the other hand when transformation is not packaged together with the model (e.g. performance reasons) it could be a test in Continuous Integration before deployment.

Regards
Roberto

Topic		Replies	Views
TFDV: Schema for LSTM Machine Learning Modeling Pipelines in Production	1	553	July 26, 2022
C2W2 - Asignement Exercise 6 - general question Machine Learning Data Lifecycle in Production	2	607	May 26, 2022
How to fix anomalies with tfx, not tfdv? Machine Learning Data Lifecycle in Production	1	520	August 13, 2021
C2 TF data validator with non-structured data Machine Learning Data Lifecycle in Production	2	563	October 13, 2021
Course2: week1: Lab : C2_W1_Lab_1_TFDV_Exercise Machine Learning Data Lifecycle in Production	4	653	July 4, 2021

TFDV infer_schema "Presence", "Valency" intiution

Related topics