Hi DeepLearning community,

I just started the graded Programming Assignment for Course 2, Week 1 (Coursera link: Data Validation). While reading the provided code, I noticed a possible bug that doesn’t affect the Assignment results, but wanted to ask here in case I’m missing something obvious.

Code cell #7 defines the function `prepare_data_splits_from_dataframe(df)`

. This function is intended to randomly split the input DataFrame `df`

into DataFrames for training (70% of input rows), evaluation (15%), and serving (15%). However, the function *first* selects a subset of rows using `iloc`

, *then* randomly samples from this subset with `.sample(frac=1, random_state=48)`

– which simply shuffles the rows of the subset.

Each subset ends up with the expected number of rows, and the results are reproducible. For example, `train_df`

always ends up with 71,236 rows. However, the rows of `train_df`

will always be a shuffled version of the first contiguous 70% of `df`

, regardless of the value of `random_state`

. Similarly, `eval_df`

and `serving_df`

end up being shuffled versions of the same contiguous 15% blocks of `df`

.

To *randomly* sample from the input DataFrame, should these lines:

```
train_df = df.iloc[:train_len].sample(frac=1, random_state=48).reset_index(drop=True)
eval_df = df.iloc[train_len: train_len + eval_len].sample(frac=1, random_state=48).reset_index(drop=True)
serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].sample(frac=1, random_state=48).reset_index(drop=True)
```

Be changed to something like the following?

```
# Shuffle rows of the full input DataFrame
shuffled_df = df.sample(frac=1, random_state=48)
# Sample from the shuffled rows of the input DataFrame
train_df = shuffled_df.iloc[:train_len].reset_index(drop=True)
eval_df = shuffled_df.iloc[train_len: train_len + eval_len].reset_index(drop=True)
serving_df = shuffled_df.iloc[train_len + eval_len: train_len + eval_len + serv_len].reset_index(drop=True)
```