TFDV infer_schema "Presence", "Valency" intiution


I wanted to understand on what bases is the Presence and Valency variables are decided.

Hello Huzefa, welcome to discourse!

In order to get a better intuition about the concepts of ‘Presence’ and ‘Valency’ in TFX, chapter 3 of this TFX paper is a good start.

A deeper dive into what triggers an anomaly for valency you can find here:

Presence has to do with data distributions and anomalies are detected if the distribution has drifted beyond a certain threshold, for instance using L-infinity distance or Jensen-Shannon divergence. See following 2 links for more details of the use in TFDV:

see “COMPARATOR_L_INFTY_HIGH” & “COMPARATOR_JENSEN_SHANNON_DIVERGENCE_HIGH” on the conditions for anomaly detection:

I hope this info helps clarifying.
Happy Learning!


Hello Maarten,

One question, why in TFX pipelines statisticsgen and schemagen are executed before transformation step? in case the transformation code is modified post production deployment or even during model development the validation checks won’t catch the data drift.


Hi @robertoz ,
Welcome to discourse and thanks for your question. I’ll try to answer it below;

We want to detect drifts or other anomalies in the raw untransformed data, in order to detect if the world has changed over time (which could decrease the model performance, so we want to be alerted early on by the example validator).
If the transformation code is changed we normally would know it, as usually we do not change the transformation between training and serving. Therefore, doing the drift detection after transformation would not give as additional useful information because we are already aware of the transformation update.
Hopefully this clarifies why we do the compute the stats and schema before transformation.

Best regards

Hello Maarten,

Yes it clarifies my doubt thank you for your answer. In case the transformation is packaged together with the model it should be more linear. In the other hand when transformation is not packaged together with the model (e.g. performance reasons) it could be a test in Continuous Integration before deployment.