Relation between Accuracy and Cost in Week 4 Assigment

Hi,

I tried to plot the accuracy and Cost against no of iterations for cat picture predictions for multi-layer neural network. I used below code :

layers_dims = [12288, 20, 7, 5, 1] 
variables=[500,700,1000,1500,2000,2500,3000,3500,4000,4500]
trains=[]
tests=[]
trains_costs=[]
tests_costs=[]
for v in variables:
    layers_dims = [12288, 20, 7, 5, 1] 
    parameters = L_layer_model(train_x, train_y, layers_dims, num_iterations = v,learning_rate=0.01)
    train_AL,c=L_model_forward(train_x,parameters)
    test_AL,c1=L_model_forward(test_x,parameters)
    trains_costs.append(compute_cost(train_AL,train_y))
    tests_costs.append(compute_cost(test_AL,test_y))
    trains.append(np.sum(predict(train_x, train_y, parameters)==train_y)/train_y.shape[1])
    tests.append(np.sum(predict(test_x, test_y, parameters)==test_y)/test_y.shape[1])
    pred_test = predict(test_x, test_y, parameters)
plt.plot(variables,trains)
plt.plot(variables,tests)

plt.show()
plt.plot(variables,trains_costs)
plt.plot(variables,tests_costs)
plt.show()

I got below graphs with costs and accuracies plotted against no of iterations
Accuracy
costs

As we can see here, though cost for test data dips at 750 iterations and then increases till 4500, accuracy is 0.8 and then dips at 1000 and then further it increases to 0.84 around 2500.

Could anyone please help me understand this graph?

  1. Does it mean that we should use iterations=750 for this model and ignore higher accuracy as cost for test data is increasing?
  2. Does it mean that after 1000 iterations, the model is over-fitting?

Thanks in advance

Very interesting! It’s great that you are doing this type of investigation. There’s always something interesting to learn. I agree that it doesn’t seem logical that the test cost would increase in the way that you show. Let’s dig in and see what more we can learn here!

First there are a couple of general things to say:

  1. Yes, you’re right that all this is overfitting. And maybe the bigger problem is that this whole situation is pretty unrealistic in that the dataset is way way too small to give a generalizable solution to a problem this complex. Here’s a thread which discusses that point in a bit more detail and shows that the dataset is very carefully curated to give the results as good as we see here.

  2. The relationship between cost and accuracy is not as straightforward as you might think at first glance. The high level point is that accuracy is quantified, but the cost isn’t. What I mean by that is illustrated by the example of a sample with a label of 1. If the \hat{y} value after 1000 iterations is 0.52, then the answer is already correct. But if after 2000 iterations, the \hat{y} value is 0.75, then the cost will be lower, but the accuracy is still the same. Of course it could also go the other direction: going from 0.75 to 0.52 in a later iteration will give you a higher cost with the same accuracy, which is what seems to be happening with the test data in your case.

  3. It’s really only accuracy that we actually care about. The actual J value doesn’t really tell you that much as we see from item 2), but there still is something puzzling in the behavior here that is worth investigating.

As far as I can see so far, your code looks completely correct. You could have simplified it a bit by using np.mean to compute the accuracy values. It would also be more efficient to rewrite the code to pass in the iteration numbers where you want the checkpoints and then you’d only have to run the training once, but I totally get why you did it the way you did: my way would be a big rewrite to the core functions, which just messes everything up and introduces more complexity.

Ok, none of the above really answers anything yet, but this is just the next step after your interesting steps above. More investigation required. Next I want to dig in a bit and actually look in more detail at the test cost numbers.

I did some experiments and confirmed that the patterns you show are really happening. Then I added some instrumentation to my code to print the following every 100 iterations:

  1. The \hat{y} values for the test set.
  2. The corresponding prediction values.
  3. The actual label values.

Here’s what I see at 1500 iterations:

yhat values: [[0.99713872 0.99286403 0.97269749 0.745034   0.95866618 0.47913776
  0.78870392 0.99615158 0.9522909  0.9887912  0.98751842 0.32393214
  0.97400805 0.99398802 0.17075616 0.99563392 0.29014813 0.9813124
  0.76656457 0.18709067 0.95832288 0.17075616 0.17075616 0.79347531
  0.95944938 0.97518387 0.94346883 0.17075616 0.24489228 0.99400855
  0.95018188 0.99893983 0.98978076 0.93799855 0.80555367 0.17075616
  0.19085674 0.57976951 0.88493086 0.17075616 0.87165991 0.95676703
  0.83044989 0.17038791 0.93011845 0.98798493 0.61161539 0.99982548
  0.63511671 0.27595954]]
predictions: [[1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 0
  0 1 1 0 1 1 1 0 1 1 1 1 1 0]]
true labels: [[1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 0
  0 1 0 0 1 1 1 0 0 0 1 1 1 0]]
Accuracy: 0.82
Cost after iteration 1500: 0.161189 test cost 0.631018 train acc 0.980861 test acc 0.820000

One thing to point out is that the test set is 66% “yes” samples. There is more detailed breakdown and error analysis on that other thread I linked in my previous reply, which is worth a look.

Then here’s what I see at 2500 iterations:

yhat values: [[0.99927585 0.99723652 0.98033822 0.87269824 0.9898528  0.61828474
  0.80678925 0.99951292 0.98817952 0.99028235 0.99756576 0.48869532
  0.99292423 0.99979253 0.07957959 0.99849999 0.22657872 0.9950299
  0.83585003 0.10305553 0.99081342 0.13144418 0.07957959 0.78723273
  0.95413344 0.99418792 0.96251797 0.07957959 0.19536421 0.99717548
  0.96394148 0.99992524 0.997602   0.98376602 0.87381301 0.07957959
  0.10156225 0.79250949 0.93490598 0.07957959 0.96428135 0.98194365
  0.83668378 0.07934687 0.96596055 0.99856256 0.55463938 0.99999521
  0.59616868 0.14916707]]
predictions: [[1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 0
  0 1 1 0 1 1 1 0 1 1 1 1 1 0]]
true labels: [[1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 0
  0 1 0 0 1 1 1 0 0 0 1 1 1 0]]
Accuracy: 0.8
Cost after iteration 2500: 0.088413 test cost 0.767885 train acc 0.985646 test acc 0.800000

Take a look at the last 5 samples just as one place to focus. Notice that the 5th one from the end is incorrectly predicted as 1, but the label is 0. I tracked through the iterations and the \hat{y} value for that particular entry keeps getting more wrong, which will drive the cost up for that one entry anyway. Also notice the 4th entry from the end. The label on that is 1 and the prediction is also 1, but the \hat{y} value is only 0.55. That one flips between 2800 and 2900 iterations:

yhat values: [[0.99909132 0.99641252 0.9549542  0.78425455 0.9784453  0.67138463
  0.69604724 0.99943497 0.98738797 0.98662602 0.99777275 0.50038284
  0.99350248 0.99989094 0.06267472 0.9979228  0.1636094  0.99460847
  0.74360261 0.07956066 0.99105395 0.1054676  0.06267472 0.80251614
  0.92935317 0.99350222 0.96601926 0.06267472 0.19181458 0.99776942
  0.9459944  0.99994248 0.9957822  0.97651009 0.81716022 0.06267472
  0.06685906 0.79409662 0.89751636 0.06267472 0.9621361  0.96900758
  0.71714597 0.06267472 0.96347886 0.99868076 0.44563754 0.99999689
  0.43177929 0.08341039]]
predictions: [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 0
  0 1 1 0 1 1 1 0 1 1 0 1 0 0]]
true labels: [[1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 0
  0 1 0 0 1 1 1 0 0 0 1 1 1 0]]
Accuracy: 0.78
Cost after iteration 2900: 0.075444 test cost 0.793889 train acc 0.985646 test acc 0.780000

Notice that the second to last sample also flipped to being wrong between 2500 and 2900.

Next I’d like to refine the analysis a bit to separately calculate the cost values on the correctly predicted samples and the incorrectly predicted samples and see if we can discern any patterns there, but that will have to wait until tomorrow …

Thanks Paul. Thanks for looking into the matter and thorough analysis. I would not have done such analysis by myself… Thanks for the help…

I was more concerned about the over-fitting part… Clearly as the no of iterations would increase , the training cost will decrease or remain same and the test set cost would increase… and ideally, accuracy should decrease after certain number of iterations because of over-fitting on training set.

So, I was thinking on the lines of sweet spot that our course instructor Andrew discusses in next course where we should stop when the test set cost starts increasing.

  1. Means can we say that for this train set, 750 iterations are enough (with 80% accuracy) or we should go ahead to look ahead with more iterations.

  2. Or as you pointed that the test cases are highly curated and also the test case size is low, we should include more test cases to reach conclusion… Your thoughts?

In case of concrete data means numerical values, one can easily plot and draw boundaries to get figure out the boundaries…

In case of pictures, what methodologies can we use… your suggestions?
Is it that individual image analysis only way ?

Thanks

I think your description of finding the sweet spot as Prof Ng describes it is a good way to go in general. Plot the accuracy during the training on both the training and test data and see if you can discern the best point at which to stop the training.

As mentioned above, this particular example is too small and specialized to really draw very many general conclusions. E.g. I’m not sure whether the pattern of increasing training too far reliably causes the cost to go up and the accuracy to go down on the test data. It would seem intuitive that there are more realistic cases where the test results get better monotonically, but plateau at too low a level.

To address overfitting in a more general case, there would be other more sophisticated things to try than simply “early stopping”. It sounds like you’ve already done at least part of Course 2, so you’ve probably seen the sections in Week 1 about bias and variance and how to deal with them. Regularization, changing the model architecture and so forth. In Course 3, he will also discuss how to apply Error Analysis to figure out how to get better results.

I haven’t gotten around to digging deeper on the behavior in our very particular case here. As mentioned above, there’s probably not that much more to learn that is generally applicable. It’s just an opportunity to pursue some curious behavior. If I find anything else interesting, I’ll let you know.

Thanks for starting the discussion!

Thanks Paul. In case of pictures, how can we visualize the over-fitting. Any ideas? means in case of plottable numbers, we can plot a scatter and then use it to draw boundaries.

The definition of overfitting is that the accuracy is higher on the training data than it is on the test data. I’m not sure what you mean by “pictures” in this case. If you mean graphs, I think your previous graphs of the accuracy values are a great example of how to visually show overfitting (the plots diverge).

Thanks Paul…appreciate your help on clearing doubts