Course2: week1: Lab : C2_W1_Lab_1_TFDV_Exercise

Hi,

In order to adjust the anomalies, you have amended the threshold value. I’m a bit confused about the threshold value of 0.9 what does it mean exactly? is it allowing 90% of the value which isn’t present in the train stats? or 10% of the value?

Relax the minimum fraction of values that must come from the domain for the feature native-country

country_feature = tfdv.get_feature(schema, ‘native-country’)
country_feature.distribution_constraints.min_domain_mass = 0.9

Thanks

Hi @optimizing_wieghts,

min_domain_mass = 0.9 means that at least 90% of the values in the eval_stats have to be in the domain (or less than 10% of values are outside of it).

The description of the anomaly detection in the TF documentation is a bit awkward, but basically states that the share of values not in the domain must be bigger than (1 - min_domain_mass) for the anomaly to occur:

you can verify it by setting
country_feature.distribution_constraints.min_domain_mass = 0.9999
(allowing less than 1 in 10000 to be out of the domain) and then the anomaly still shows up.

Hopefully that clarifies…

Regards,
Maarten

3 Likes

@mjsmid thanks very much for your reply :slight_smile:

Hi @mjsmid ,

I try to understand min_domain_mass = 0.9 well by illustrating it with an example, please help to check if I get it right:

  • Saying country_feature has 100 values with 5 categories in training set.
  • By setting min_domain_mass = 0.9, the anomalies will not be detected unless the number of anomalies is greater than 10.

Is it right ?
Thanks~

1 Like

Hi @Damon,

Thanks for your question.
Your understanding is right; the anomaly will only be detected/reported when more than 10 country_features of the 100 data points are outside of the 5 categories found in the training set.

Good luck and enjoy the rest of the course,

Maarten

2 Likes