Doubt in Interpretation of code given in section 5. Imputation for C2_W2_ assignment

5. Imputation

dropped_rows = X_train[X_train.isnull().any(axis=1)]

columns_except_Systolic_BP = [col for col in X_train.columns if col not in [‘Systolic BP’]]

for col in columns_except_Systolic_BP:
sns.distplot(X_train.loc[:, col], norm_hist=True, kde=False, label=‘full data’)
sns.distplot(dropped_rows.loc[:, col], norm_hist=True, kde=False, label=‘without missing data’)
plt.legend()

plt.show()

I have the following doubts:

  1. The labels say ‘full data’ and ‘without missing data’ . doesn’t both mean the same?

  2. The label says ’ without missing data’ , however you are only plotting the dropped columns. So that means the orange distribution shows the distribution of the dropped columns i.e it plots the values of the rows for which the systolic BP is missing. Please correct me if i am wrong

  3. In the interpretation part it is written “But when considering the age covariate, we see that much more data tends to be missing for patients over 65. The reason could be that blood pressure was measured less frequently for old people to avoid placing additional burden on them.” but as per my understanding, data is missing for patients <45yrs because if we see the orange distribution then frequency of those patients is higher and as for patients >65yrs the frequency is less (in the orange dist) so that should mean most patients with age >65yrs have their systolic BP. Please correct me if my interpretation is wrong.

1 Like

Hello @AritaHalder

Here full data means data that includes the missing values(THE BELOW STATEMENT IS FROM THE ASSIGNMENT YOU HAVE DOUBT)

We see that our train and validation sets have missing values, but luckily our test set has complete cases.

THEN NEXT STATEMENT MENTIONS

As a first pass, we will begin with a complete case analysis , dropping all of the rows with any missing data. Run the following cell to drop these rows from our train and validation sets.

SO THE ABOVE STATEMENT TELLS THERE IS FILTERED OR REFINED DATA WHERE ROWS WITH ANY MISSIN DATA HAS BEEN DROPPED, AND THAT IS HOW BOTH DIFFER.

INCORRECT UNDERSTANDING, THE ORANGE COLOuR DISTRIBUTION IS THE DATA WHERE ROWS WITH MISSING DATA HAS BEEN DROPPED. I am sharing the table which shows after dropping the rows how columns are seen

So if you notice what all columns are being used but not all rows are being used.

This part I am unclear from where you are trying to interpret, it will help me response you better if you could share some screenshots.

Regards
DP

1 Like