C2W1 Assignment: minor mistake in `prepare_data_splits_from_dataframe`?

Hi DeepLearning community,

I just started the graded Programming Assignment for Course 2, Week 1 (Coursera link: Data Validation). While reading the provided code, I noticed a possible bug that doesn’t affect the Assignment results, but wanted to ask here in case I’m missing something obvious.

Code cell #7 defines the function prepare_data_splits_from_dataframe(df). This function is intended to randomly split the input DataFrame df into DataFrames for training (70% of input rows), evaluation (15%), and serving (15%). However, the function first selects a subset of rows using iloc, then randomly samples from this subset with .sample(frac=1, random_state=48) – which simply shuffles the rows of the subset.

Each subset ends up with the expected number of rows, and the results are reproducible. For example, train_df always ends up with 71,236 rows. However, the rows of train_df will always be a shuffled version of the first contiguous 70% of df, regardless of the value of random_state. Similarly, eval_df and serving_df end up being shuffled versions of the same contiguous 15% blocks of df.

To randomly sample from the input DataFrame, should these lines:

train_df = df.iloc[:train_len].sample(frac=1, random_state=48).reset_index(drop=True)
eval_df = df.iloc[train_len: train_len + eval_len].sample(frac=1, random_state=48).reset_index(drop=True)
serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].sample(frac=1, random_state=48).reset_index(drop=True)

Be changed to something like the following?

# Shuffle rows of the full input DataFrame
shuffled_df = df.sample(frac=1, random_state=48)

# Sample from the shuffled rows of the input DataFrame
train_df = shuffled_df.iloc[:train_len].reset_index(drop=True)
eval_df = shuffled_df.iloc[train_len: train_len + eval_len].reset_index(drop=True)
serving_df = shuffled_df.iloc[train_len + eval_len: train_len + eval_len + serv_len].reset_index(drop=True)

Thanks for bringing this up.

The 2nd approach you’ve described is the right approach to create train / eval and serving sets since it doesn’t care about the ordering of the provided dataset.

I’ve asked the staff to clarify their perspective in the notebook.

1 Like

Thanks for confirming, Balaji! I’ll leave the code unchanged so that I get expected outputs, but it’s good to hear I’m not missing something basic.