C2W1_Assignment Exercise 5: Detecting Anomalies. No anomalies found. !?

There are no anomalies detected, when validating the evaluation statistics in exercise 5. And weirdly, the auto grader has given zero points for the “Exercise 6: Fix evaluation anomalies in the schema”. And I had to check my implementation with a friend who did the assignment before, and there was no difference. I think there is maybe a problem with the evaluation data

Hi @husam

Can you please provide your code block for:

def calculate_and_display_anomalies(statistics, schema):

This is the block just before the block that you have copied from Exercise 5.

Can you also please provide your definition of eval_stats under exercise 4.

Thanks
Chris

Hi @Chris, thanks for your response.
Here is the block for calculate_and_display_anomalies:

def calculate_and_display_anomalies(statistics, schema):
    '''
    Calculate and display anomalies.

            Parameters:
                    statistics : Data statistics in statistics_pb2.DatasetFeatureStatisticsList format
                    schema : Data schema in schema_pb2.Schema format

            Returns:
                    display of calculated anomalies
    '''
    ### START CODE HERE
    # HINTS: Pass the statistics and schema parameters into the validation function 
    anomalies = tfdv.validate_statistics(statistics=statistics, schema=schema)
    
    # HINTS: Display input anomalies by using the calculated anomalies
    tfdv.display_anomalies(anomalies)
    ### END CODE HERE

And I am not sure what you mean by definition but here is the first block in the exersice 4 which contains the code for generating the eval_stats:

# Generate evaluation dataset statistics
# HINT: Remember to use the evaluation dataframe and to pass the stats_options (that you defined before) as an argument
eval_stats = tfdv.generate_statistics_from_dataframe(eval_df, stats_options=stats_options)

Husam

Hi @husam

That all looks fine.

At a quick glance of the output of:

tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

Do the stats look different?

Hi @Chris
yes, the stats do look different. I mean, at least the size of the data is obviously different.

Edit: I think, I am pretty sure I got the code right

Hi @husam

For my training data record count I get 71.2k, not 102k.

My next guess would be that there is an issue in def prepare_data_splits_from_dataframe(df):
Can you send me your code for this?

If you have more records in your training set, this would make sense why you are not getting those 2 anomalies, because the anomalies may have been included in your training data.

Hi @Chris

sure, here is the code:

def prepare_data_splits_from_dataframe(df):
    '''
    Splits a Pandas Dataframe into training, evaluation and serving sets.

    Parameters:
            df : pandas dataframe to split

    Returns:
            train_df: Training dataframe(70% of the entire dataset)
            eval_df: Evaluation dataframe (15% of the entire dataset) 
            serving_df: Serving dataframe (15% of the entire dataset, label column dropped)
    '''
    
    # 70% of records for generating the training set
    train_len = int(len(df) * 0.7)
    
    # Remaining 30% of records for generating the evaluation and serving sets
    eval_serv_len = len(df) - train_len
    
    # Half of the 30%, which makes up 15% of total records, for generating the evaluation set
    eval_len = eval_serv_len // 2
    
    # Remaining 15% of total records for generating the serving set
    serv_len = eval_serv_len - eval_len 
 
    # Sample the train, validation and serving sets. We specify a random state for repeatable outcomes.
    train_df = df.iloc[:train_len].sample(frac=1, random_state=48).reset_index(drop=True)
    eval_df = df.iloc[train_len: train_len + eval_len].sample(frac=1, random_state=48).reset_index(drop=True)
    serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].sample(frac=1, random_state=48).reset_index(drop=True)
 
    # Serving data emulates the data that would be submitted for predictions, so it should not have the label column.
    serving_df = serving_df.drop(['readmitted'], axis=1)

    return train_df, eval_df, serving_df

Hi @husam

That looks fine.

Based on what you have sent me, my last guess where something could have gone wrong is the first line under exercise 1:

train_stats = tfdv.generate_statistics_from_dataframe(train_df,stats_options)

Please confirm that you didn’t pass df instead of train_df.

You are also welcome to send me your assignment ipynb file for me to debug.

Hi @Chris
I cannot believe that I passed df instead of train_df :sweat_smile:. But that is it. Now all looks good. I passed the assignment.
Thanks a lot.

It’s a pleasure @husam. I’m glad we could get to the bottom of it.

1 Like

I encountered the same problem with ‘No anomalies found’ output, thus failing Ex 5. Surprisingly, I passed all the rest of the exercises. Why?

The answer is in the call of tfdv.validate_statistics().
It apparently needs to called with the parameter names explicitly assigned, like this:
tfdv.validate_statistics(statistics=statistics, schema=schema)

do not assume
tfdv.validate_statistics(statistics, schema)
is the same, as this function call with param names omitted will result in no anomalies found.

1 Like