AI4M Course 3 Week 1 UNQ_11


I’m stuck at course 3 week 1 assignment exercise 11 # UNQ_C11 def treatment_dataset_split for the part below.

 # From the training set, get the labels of patients who received treatment
    y_treat_train = y_train[X_treat_train.index]

Frankly I’m not 100% sure exactly what the question is asking for.

  1. Is it just to subset the input y_train numpy array to output array y_treat_train WITHOUT adding another column for labels? More specifically, is y_treat_train expected to have 1 column?
  2. If y_treat_train is expected to have additional column for label, then what should the labels be? In the text cell below, it uses a column called ID. But in later cells, the actual data X_dev doesn’t have such ID column. So it doesn’t seem correct to hardcode in the function to have the labels of y_treat_train to be ID column. This is VERY confusing to me.
  3. I tried using index as shown in the snippet above but it failed, because the input y_train is an array, without index, unlike the input X_train. Normally, just like the rest of the assignment, input X and y should have the same index, and both shall be dataframe.
  4. Based on description of this function def treatment_dataset_split, it’s expecting output y_treat_train to be np.array. However, in later cells, you can see output of treatment_dataset_split is used as input for holdout_grid_search which is expecting y_treat_train to be a 1 column dataframe. So this is another confusing inconsistency to me.

Hopefully I’m making some sense this. This truly is quite frustrating. I’d appreciate any clarifications here.



I seem to manage to make it work by adding two additional lines at the beginning to enforce input y_train / y_val to be panda df, with the same index from corresponding X_train / X_val

y_train = pd.DataFrame(y_train, index = X_train.index)
y_val = pd.DataFrame(y_val, index = X_val.index)

After this, I can successfully generate y_treat/control_train/val, using example here

 y_treat_train = y_train.loc[X_treat_train.index]

This way, it works for unit test which uses array as input for y_train/val, and also works for later cell which uses df as input.

However, the autograder is still failing

I’d appreciate if someone could look into this myterious bug.



Hi @MrHuanwang,

I have replied to your failing autograder post.