How to specify schema environment in ExampleValidator?

When we use tfdv.validate_statistics, there is a parameter to specify schema environment. However, there seems to be no such parameter for ExampleValidator. How do you specify the schema environment when using ExampleValidator?

hI @Anilsekhar

please take a look at the C2_W2 Ungraded lab

for an example

Instantiate ExampleValidator with the StatisticsGen and SchemaGen ingested data

example_validator = ExampleValidator(
statistics=statistics_gen.outputs[‘statistics’],
schema=schema_gen.outputs[‘schema’])

The ExampleValidator is a TFX component.

hope this can help

Hi @fabioantonini ,

I actually got this doubt after going through C2_W2 Ungraded lab. In the schema_gen instance, we define two environments (‘TRAINING’ and ‘SERVING’) and we specify that the ‘SERVING’ schema environment does not include ‘label’ feature.

My question is how does ‘example_validator’ know which schema environment to use (ie, ‘TRAINING’ or ‘SERVING’) when it tries to validate the statistics in statistics_gen against schema_gen. There is no option to specify which schema environment should be used during validation in the definition of ‘example_validator’.

In an earlier lab, when we were using tfdv.validate_statistics to validate statistics, we could specify the schema environment to be used while validating statistics in the following way:

serving_anomalies_with_env = tfdv.validate_statistics( serving_stats, schema, environment=‘SERVING’)

Hi @Anilsekhar
sorry but previously the question was not so clear to me.
Do you mean the C2_W3_Lab_2_IterativeSchema? In the W2 I didn’t find what you say about the TRAINING and SERVING environments.
Thanks

1 Like

Hi @fabioantonini ,

You are right. I wrongly mentioned C2_W2 Ungraded lab in place of C2_W3_Lab_2_IterativeSchema.
The following is the snapshot of the relevant code snippet in which the feature ‘label’ is removed from the ‘SERVING’ environment.


The following is the snapshot of the relevant code snippet in which example_validator is run.

The following snapshot shows the list of arguments we can specify for example_validator. There is no option to specify the schema environment.

I was wondering how the example_validator can know which schema environment needs to be considered while it does the validation.

Hi @Anilsekhar
Unfortunately I have not been able to find an answer to your question. So at the moment I can confirm what you said: there is no way to run the ExampleValidator passing in the schema environment requested.
Anyway I realized that a more meaningful example is shown in the Graded Lab.
As in the Ungraded Lab in exercise 6 (of the Graded Lab) the serving dataset statistics in the SERVING environment is validated using the new (curated) schema:

anomalies = tfdv.validate_statistics(serving_stats, schema=schema, environment='SERVING')

Anyway in the next exercises 7, 8 and 9 the right usage of ExampleValidator is explained:

  1. The curated schema is imported into ML Metadata by the ImportNode API
  2. Use StatisticsGen to compute the statistics using the curated schema (don’t forget to pass the parameter ‘schema’ to tell StatisticsGen to infer the data types from this new schema.
  3. Check if there are any anomalies using ExampleValidator . You will need to pass in the updated statistics and schema from the previous sections.

In the Ungraded lab the steps 1 and 2. are missing and so the ExampleValidator runs with the previous parameters (older statistics and old schema). I think this is absolutely misleading. I will try to address it with our internal backend to fix the misleading example of the Ungraded Lab.

2 Likes