There are no anomalies detected, when validating the evaluation statistics in exercise 5. And weirdly, the auto grader has given zero points for the “Exercise 6: Fix evaluation anomalies in the schema”. And I had to check my implementation with a friend who did the assignment before, and there was no difference. I think there is maybe a problem with the evaluation data
Hi @husam
Can you please provide your code block for:
def calculate_and_display_anomalies(statistics, schema):
This is the block just before the block that you have copied from Exercise 5.
Can you also please provide your definition of eval_stats
under exercise 4.
Thanks
Chris
Hi @Chris, thanks for your response.
Here is the block for calculate_and_display_anomalies:
def calculate_and_display_anomalies(statistics, schema):
'''
Calculate and display anomalies.
Parameters:
statistics : Data statistics in statistics_pb2.DatasetFeatureStatisticsList format
schema : Data schema in schema_pb2.Schema format
Returns:
display of calculated anomalies
'''
### START CODE HERE
# HINTS: Pass the statistics and schema parameters into the validation function
anomalies = tfdv.validate_statistics(statistics=statistics, schema=schema)
# HINTS: Display input anomalies by using the calculated anomalies
tfdv.display_anomalies(anomalies)
### END CODE HERE
And I am not sure what you mean by definition but here is the first block in the exersice 4 which contains the code for generating the eval_stats
:
# Generate evaluation dataset statistics
# HINT: Remember to use the evaluation dataframe and to pass the stats_options (that you defined before) as an argument
eval_stats = tfdv.generate_statistics_from_dataframe(eval_df, stats_options=stats_options)
Husam
Hi @husam
That all looks fine.
At a quick glance of the output of:
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')
Do the stats look different?
Hi @Chris
yes, the stats do look different. I mean, at least the size of the data is obviously different.
Edit: I think, I am pretty sure I got the code right
Hi @husam
For my training data record count I get 71.2k, not 102k.
My next guess would be that there is an issue in def prepare_data_splits_from_dataframe(df):
Can you send me your code for this?
If you have more records in your training set, this would make sense why you are not getting those 2 anomalies, because the anomalies may have been included in your training data.
Hi @Chris
sure, here is the code:
def prepare_data_splits_from_dataframe(df):
'''
Splits a Pandas Dataframe into training, evaluation and serving sets.
Parameters:
df : pandas dataframe to split
Returns:
train_df: Training dataframe(70% of the entire dataset)
eval_df: Evaluation dataframe (15% of the entire dataset)
serving_df: Serving dataframe (15% of the entire dataset, label column dropped)
'''
# 70% of records for generating the training set
train_len = int(len(df) * 0.7)
# Remaining 30% of records for generating the evaluation and serving sets
eval_serv_len = len(df) - train_len
# Half of the 30%, which makes up 15% of total records, for generating the evaluation set
eval_len = eval_serv_len // 2
# Remaining 15% of total records for generating the serving set
serv_len = eval_serv_len - eval_len
# Sample the train, validation and serving sets. We specify a random state for repeatable outcomes.
train_df = df.iloc[:train_len].sample(frac=1, random_state=48).reset_index(drop=True)
eval_df = df.iloc[train_len: train_len + eval_len].sample(frac=1, random_state=48).reset_index(drop=True)
serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].sample(frac=1, random_state=48).reset_index(drop=True)
# Serving data emulates the data that would be submitted for predictions, so it should not have the label column.
serving_df = serving_df.drop(['readmitted'], axis=1)
return train_df, eval_df, serving_df
Hi @husam
That looks fine.
Based on what you have sent me, my last guess where something could have gone wrong is the first line under exercise 1:
train_stats = tfdv.generate_statistics_from_dataframe(train_df,stats_options)
Please confirm that you didn’t pass df
instead of train_df
.
You are also welcome to send me your assignment ipynb file for me to debug.
Hi @Chris
I cannot believe that I passed df
instead of train_df
. But that is it. Now all looks good. I passed the assignment.
Thanks a lot.
It’s a pleasure @husam. I’m glad we could get to the bottom of it.
I encountered the same problem with ‘No anomalies found’ output, thus failing Ex 5. Surprisingly, I passed all the rest of the exercises. Why?
The answer is in the call of tfdv.validate_statistics()
.
It apparently needs to called with the parameter names explicitly assigned, like this:
tfdv.validate_statistics(statistics=statistics, schema=schema)
do not assume
tfdv.validate_statistics(statistics, schema)
is the same, as this function call with param names omitted will result in no anomalies found.