Detecting 0 anomalies for large feature set in 2nd lab C3W1

Since posting code snippets is against the rules, I’ll try to describe the situation as best I can.

In the final section of the anomaly detection lab at the end of C3_W1, there is a larger data set with 11 features that your code gets applied to. My first time through the lab, I didn’t follow the hints for “select_threshold” and used a different methodology which created arrays of float 1s and 0s using broadcasting of multiplication and subtraction and np.ceil (rather than Boolean 1s and 0s using element-wise logic). The function passed the test cell immediately below it, but when I ran the final code cell in the lab (the one that applies select_threshold to the larger data set) it identified 0 anomalies, chose epsilon=0, and calculated F1=0.

The grader compiled the code and gave a passing score, but I couldn’t figure out why the last cell doesn’t work with this version of select_threshold. I also tried implementing slightly different versions of the algebraic method using integer arithmetic instead of floats, but that didn’t change the result. Modifying the approach by using np.abs instead of multiplying by -1 causes the final code cell to correctly produce the number of anomalies and epsilon, but it calculates F1 to be very small (~0.008). Of course, when I changed the code to use element-wise Boolean tests instead, that fixed the output of the final cell to be the expected result.

Any ideas on what’s going on?

Hello @Geoffrey_Blum,

From your description, you were able to produce the expected result. If I were you, I would -

  1. add some print lines to show the outcomes of all intermediate variables, and
  2. make a plan in my mind on how I would change the code from the version that produces the expected result to the version that I had question.

Then I will go through the plan to step-by-step make changes to the code, and see starting from when and where one (or more) intermediate variables does not behave expectedly. In this way I can narrow down the problem to hopefully the one line that works unexpectedly, and continue my investigation from there.

Good luck!

Raymond

Thanks, @rmwkwok! Your suggestion gave me an idea, and in retrospect, it seems kind of obvious.

I’ve checked, and this is the problem: The validation labels (y_val) are passed as an array of unsigned integers, while p_val is an array of floats. By using calculations of the form “(y_val - 1) * -1” I was doing some weird uint conversions.

Notably, I was getting the correct values in the test cell because

(a) the correct epsilon happens to have 0 false positives (which is the only calculation that uses the uint arithmetic), meaning that the rest of the calculations work correctly, and

(b) the tests to confirm correctness pass y_val as an array of ints (rather than uints).

I would encourage an update to the lab which changes either the validation labels used in the lab or in the tests to match one another (i.e. either y_val is always int or always uint).

Thanks for your analysis and suggestion.

Update: I have submitted a ticket for the course staff to consider.