Difference Between Vectorized and Non-vectorized Implementation of Cats Classifier

For curiosity purposes I thought why not try the W2A2 a different way, so I decided to do a non-vectorized implementation of it. To my great surprise, I found that though definitely the non-vectorized implementation takes 4 hours (as compared to 7 seconds for vectorized implementation), the accuracy of non-vectorized implementation is much higher as you can see in the following picture:

This is despite the fact that, as you can see, the cost of vectorized version decreases much more steeply and ends at a much smaller value as compared to the non-vectorized version.
I checked and double-checked my non-vectorized implementation, line-by-line, and it seems good to me!
Please explain why this is happening.

I suspect one of your implementations is mathematically different than the other.

Or there’s a mistake in your data sets (training vs testing being swapped or inconsistent).

Your non-vectorized version gives rather suspicious results.

I agree with Tom: there must be something inconsistent about your non-vectorized implementation. Notice that the cost hardly changes at all. Why don’t you modify both implementations to also print the train and test accuracy at the same points that you print the cost (every 100 iterations), so that you can see if there is anything obvious in the patterns of behavior.

If that doesn’t show anything, then maybe it’s time to take a look at your actual code, but we don’t want to do that in public.

As an example I instrumented my code to print the train accuracy after every 400 iterations so that you can see that the cost and the accuracy move together in a way that makes sense:

layers_dims = [12288, 20, 7, 5, 1]
Cost after iteration 0: 0.7717493284237686
Accuracy: 0.5119617224880383
Cost after iteration 100: 0.6720534400822914
Cost after iteration 200: 0.6482632048575212
Cost after iteration 300: 0.6115068816101356
Cost after iteration 400: 0.5670473268366111
Accuracy: 0.8086124401913874
Cost after iteration 500: 0.5401376634547801
Cost after iteration 600: 0.5279299569455267
Cost after iteration 700: 0.4654773771766851
Cost after iteration 800: 0.369125852495928
Accuracy: 0.9186602870813395
Cost after iteration 900: 0.39174697434805344
Cost after iteration 1000: 0.31518698886006163
Cost after iteration 1100: 0.2726998441789385
Cost after iteration 1200: 0.23741853400268137
Accuracy: 0.9760765550239232
Cost after iteration 1300: 0.19960120532208644
Cost after iteration 1400: 0.18926300388463307
Cost after iteration 1500: 0.16118854665827753
Cost after iteration 1600: 0.14821389662363316
Accuracy: 0.9808612440191385
Cost after iteration 1700: 0.13777487812972944
Cost after iteration 1800: 0.1297401754919012
Cost after iteration 1900: 0.12122535068005211
Cost after iteration 2000: 0.11382060668633713
Accuracy: 0.9808612440191385
Cost after iteration 2100: 0.10783928526254133
Cost after iteration 2200: 0.10285466069352679
Cost after iteration 2300: 0.10089745445261786
Cost after iteration 2400: 0.09287821526472398
Accuracy: 0.9856459330143539
Cost after iteration 2499: 0.08843994344170202

It turns out not to be easy to also print the test accuracy without changing the API definition, since only the training data is passed into L_layer_model.

A couple of observations about the above:

  1. Note that there is no improvement in training accuracy between 1600 and 2000, even though the cost continues to go down. That’s not unreasonable, since accuracy is quantized. If a sample is “true” and the prediction goes from 0.65 to 0.75, the accuracy doesn’t improve even if the cost is lower.
  2. Maybe this is a demonstration that early stopping would save us a lot of compute. E.g. we don’t really gain much from the last 800 iterations, but maybe before we make that leap we need to consider the test accuracy numbers as well.

It would be interesting to see something like the above for your two implementations.

Hello, @sushantnair, I think Paul’s suggestions are great! I also always print my train/test losses/accuracies along training.

There is also one good thing from your shared numbers - the initial costs are the same, so the problem is likely to be in the evolvement of your weights’ values, which can have to do with the gradients.

So, one more critical print that you can do is the first (e.g.) five iterations’ gradient values. If you have many layers and weights, then maybe 5-10 weight values per layer. Certainly if the last layer’s gradients are off, then all previous layers will, so the check should start from the last layer.

For example, in the initial iteration, if the weights’ values and the training samples are the same for both vectorized and non-vectorized (which is very likely to be true, by the same initial losses), then their gradients should be the same. Question is, are they? You find out which gradient is not the same, you know where to start to debug :wink:


1 Like

Thanks everyone for your suggestions. But I’d like to clarify some things
As a reference, my Vectorized version is accurate as it passed the autograder with a score of 100.
Carefully following along the lines of the Vectorized version, I have implemented the Non-vectorized version. Also, at each step checks were performed, just as in the Vectorized version, to ensure that the functions work correctly.
Also, to be honest, I am not really up for running that thing for another 4 hours :slight_smile: So it’d be very helpful if I could share my non-vectorized code to someone. Please tell me how do I do that, in a way that my code if visible only to the intended person and not to the public.
Thanks again!

Sure, I will send you a DM about how to share the code privately.

But why is that hard? You only have to click once to start the job. For the 4 hours, you can go take a walk and get lunch and run a marathon in the meantime.

1 Like

Actually it also slows down my laptop and I also need it for some other computations!
Also in case indeed there’s something wrong, I think it best that some expert have a look at it, rather than me just merely running wrong code.
I still do believe that I have done it correctly as the sanity checks in between, for each implemented function, do seem OK.
Anyways please DM me.

The point is that debugging is part of the job of programming. You’re not done until it works. If reading and understanding the code is not enough to “get you there”, then the next step is running it with instrumentation added, so that you can investigate why it’s wrong. In debug mode, you don’t have to run it the full 4 hours. The hope would be that something would become apparent relatively quickly if you add useful instrumentation of the sort that Raymond was describing.

You are just putting the work on someone else. That’s a fine strategy if you can find someone who is willing to do that work for free. Maybe you get lucky in that case. Stay tuned. :nerd_face:

1 Like

BTW I did send you a DM about how to share code last night, but I don’t see anything there yet … In Discourse, DMs show up in your queue just like replies on public threads, but you can distinguish them by the “envelope” icon.

I agree with Paul. You definitely do not have to run it for 4 hours. I believe we can start with just the first 5 iterations, with those additional prints.



Thanks Paul! Actually I’m new here, so it’ll take some time till I get a hang of the features.