C2W1_Assignment Exercise 5: Detecting Anomalies. No anomalies found. !?

husam · August 17, 2021, 6:24am

There are no anomalies detected, when validating the evaluation statistics in exercise 5. And weirdly, the auto grader has given zero points for the “Exercise 6: Fix evaluation anomalies in the schema”. And I had to check my implementation with a friend who did the assignment before, and there was no difference. I think there is maybe a problem with the evaluation data

Chris · August 20, 2021, 6:29pm

Hi @husam

Can you please provide your code block for:

def calculate_and_display_anomalies(statistics, schema):

This is the block just before the block that you have copied from Exercise 5.

Can you also please provide your definition of eval_stats under exercise 4.

Thanks
Chris

husam · August 21, 2021, 8:18am

Hi @Chris, thanks for your response.
Here is the block for calculate_and_display_anomalies:

def calculate_and_display_anomalies(statistics, schema):
    '''
    Calculate and display anomalies.

            Parameters:
                    statistics : Data statistics in statistics_pb2.DatasetFeatureStatisticsList format
                    schema : Data schema in schema_pb2.Schema format

            Returns:
                    display of calculated anomalies
    '''
    ### START CODE HERE
    # HINTS: Pass the statistics and schema parameters into the validation function 
    anomalies = tfdv.validate_statistics(statistics=statistics, schema=schema)
    
    # HINTS: Display input anomalies by using the calculated anomalies
    tfdv.display_anomalies(anomalies)
    ### END CODE HERE

And I am not sure what you mean by definition but here is the first block in the exersice 4 which contains the code for generating the eval_stats:

# Generate evaluation dataset statistics
# HINT: Remember to use the evaluation dataframe and to pass the stats_options (that you defined before) as an argument
eval_stats = tfdv.generate_statistics_from_dataframe(eval_df, stats_options=stats_options)

Husam

Chris · August 23, 2021, 10:03am

Hi @husam

That all looks fine.

At a quick glance of the output of:

tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

Do the stats look different?

husam · August 29, 2021, 5:17am

Hi @Chris
yes, the stats do look different. I mean, at least the size of the data is obviously different.

Edit: I think, I am pretty sure I got the code right

Chris · August 31, 2021, 2:25pm

Hi @husam

For my training data record count I get 71.2k, not 102k.

My next guess would be that there is an issue in def prepare_data_splits_from_dataframe(df):
Can you send me your code for this?

If you have more records in your training set, this would make sense why you are not getting those 2 anomalies, because the anomalies may have been included in your training data.

husam · September 4, 2021, 7:11am

Hi @Chris

sure, here is the code:

def prepare_data_splits_from_dataframe(df):
    '''
    Splits a Pandas Dataframe into training, evaluation and serving sets.

    Parameters:
            df : pandas dataframe to split

    Returns:
            train_df: Training dataframe(70% of the entire dataset)
            eval_df: Evaluation dataframe (15% of the entire dataset) 
            serving_df: Serving dataframe (15% of the entire dataset, label column dropped)
    '''
    
    # 70% of records for generating the training set
    train_len = int(len(df) * 0.7)
    
    # Remaining 30% of records for generating the evaluation and serving sets
    eval_serv_len = len(df) - train_len
    
    # Half of the 30%, which makes up 15% of total records, for generating the evaluation set
    eval_len = eval_serv_len // 2
    
    # Remaining 15% of total records for generating the serving set
    serv_len = eval_serv_len - eval_len 
 
    # Sample the train, validation and serving sets. We specify a random state for repeatable outcomes.
    train_df = df.iloc[:train_len].sample(frac=1, random_state=48).reset_index(drop=True)
    eval_df = df.iloc[train_len: train_len + eval_len].sample(frac=1, random_state=48).reset_index(drop=True)
    serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].sample(frac=1, random_state=48).reset_index(drop=True)
 
    # Serving data emulates the data that would be submitted for predictions, so it should not have the label column.
    serving_df = serving_df.drop(['readmitted'], axis=1)

    return train_df, eval_df, serving_df

Chris · September 6, 2021, 8:05pm

Hi @husam

That looks fine.

Based on what you have sent me, my last guess where something could have gone wrong is the first line under exercise 1:

train_stats = tfdv.generate_statistics_from_dataframe(train_df,stats_options)

Please confirm that you didn’t pass df instead of train_df.

You are also welcome to send me your assignment ipynb file for me to debug.

husam · September 7, 2021, 7:47am

Hi @Chris
I cannot believe that I passed df instead of train_df . But that is it. Now all looks good. I passed the assignment.
Thanks a lot.

Chris · September 7, 2021, 8:35am

It’s a pleasure @husam. I’m glad we could get to the bottom of it.

d_snook · May 23, 2022, 11:05pm

I encountered the same problem with ‘No anomalies found’ output, thus failing Ex 5. Surprisingly, I passed all the rest of the exercises. Why?

The answer is in the call of tfdv.validate_statistics().
It apparently needs to called with the parameter names explicitly assigned, like this:
tfdv.validate_statistics(statistics=statistics, schema=schema)

do not assume
tfdv.validate_statistics(statistics, schema)
is the same, as this function call with param names omitted will result in no anomalies found.

Topic		Replies	Views
C2W1_Assignment: No anomaly for eval_stats Machine Learning Data Lifecycle in Production	3	575	June 12, 2021
C2W1 Assignment - Exercise 5: eval_stats hardcoded in anomalies detection convenience method Machine Learning Data Lifecycle in Production	2	515	May 11, 2023
Course2 week1 exercise 7 anomalies Machine Learning Data Lifecycle in Production	8	751	November 17, 2022
C2W1 Assignment: No anomaly found for serving df Machine Learning Data Lifecycle in Production	5	644	June 19, 2021
Problem in Week 1 Assignment Machine Learning Data Lifecycle in Production	2	570	September 8, 2022

C2W1_Assignment Exercise 5: Detecting Anomalies. No anomalies found. !?

Related topics