Some Experiments with the Cat Recognition Assignment (C1W4A2)

It is frequently observed by students using the “Test with Your Own Image” section of the C1W4 Application Assignment with their own uploaded images that even the 4 layer model that we trained doesn’t do very well on new images, even though it has 80% accuracy on the test set here. It turns out that the datasets we have here are quite small compared to the sizes required to get good “generalizable” performance on an image recognition task like this. As a comparison, the Kaggle “Cats and Dogs” dataset has 25k images. It’s clear that the limitations of the online environment here required them to come up with pretty small datasets, so it occurred to me to flip the question around: how did they get such good performance with such a small dataset? Is there something special about the dataset that they are using here that allows them to get such relatively good performance with so few input data samples?

The first step is to do a little error analysis on the results. For all the experiments here, I increased the number of iterations to 3000, but used the same 4 layer network and the learning rate of 0.0075 that they used for the “official” results. Here’s the result from that run with the original dataset analyzed with a little extra code to compute the numbers of false positives and false negatives on the test set:

layers_dims = [12288, 20, 7, 5, 1] #  4-layer model
parameters, costs = L_layer_model(train_x, train_y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost = True)
pred_train = predict(train_x, train_y, parameters)
pred_test = predict(test_x, test_y, parameters)
print(f"pred_test error count = {np.sum(test_y != pred_test)}")
print(f"pred_test false negatives = {np.sum(pred_test[test_y == 1] == 0)}")
print(f"pred_test false positives = {np.sum(pred_test[test_y == 0] == 1)}")
print_mislabeled_images(classes, test_x, test_y, pred_test)

Running that gives this result:

Accuracy: 0.9904306220095691
Accuracy: 0.8200000000000001
pred_test error count = 9
pred_test false negatives = 2
pred_test false positives = 7

So you can see that most of the errors on the test set are false positives, meaning that the model seems to be a bit “yes happy”.

The next thing to look at is the balance of “cat” (yes) samples versus “non-cat” (no) samples in the two datasets. We already know that training set has 209 samples and the test set has 50 samples. Let’s see how many of each are “true” samples:

print(f"sum(train_y) = {np.sum(train_y)}")
print(f"sum(test_y) = {np.sum(test_y)}")
print(f"train positive sample ratio {np.sum(train_y)/train_y.shape[1]}")
print(f"test positive sample ratio {np.sum(test_y)/test_y.shape[1]}")
sum(train_y) = 72
sum(test_y) = 33
train positive sample ratio 0.3444976076555024
test positive sample ratio 0.66

Interesting! The training set has only 34% cats, but the test set is 66% cats, which makes things seem a bit “unbalanced”. But maybe that’s a good strategy if they know that the learned model is “yes happy”. So the next question is whether that imbalance is important or not. One way to experiment with that would be to trade positive samples from the test set with negative samples from the training set to make the two look a bit more similar. Unfortunately because of the smaller size of the test set, we don’t have enough positive samples to get the training set to 50/50 without completely depleting the positive examples in the test set. The reason for trading entries rather than just moving them is to try to control the number of variables that we are changing in this scientific experiment. If we increase the size of the training set, then we can’t be sure whether it’s the balance change or the size change that made the difference.

Here’s a block of code to trade the same number (numTrade) of positive images from the test set with the same number of negative images from the training set:

print(f"Starting positive samples: train = {np.sum(train_y)}, test = {np.sum(test_y)}")
numTrade = 4
test_x_pos = test_x[:,np.squeeze(test_y == 1)]
test_x_neg = test_x[:,np.squeeze(test_y == 0)]
# Permute the positive samples randomly before we pick the ones to trade
perm = np.squeeze(np.random.permutation(test_x_pos.shape[1]))
test_x_pos_perm = test_x_pos[:,perm]

test_x_pos_trade = test_x_pos_perm[:,0:numTrade]
test_x_pos_keep = test_x_pos_perm[:,numTrade:]
test_y_pos_trade = np.ones((1,test_x_pos_trade.shape[1]), dtype = 'int64')
test_y_pos_keep = np.ones((1,test_x_pos_keep.shape[1]), dtype = 'int64')
test_y_neg = np.zeros((1,test_x_neg.shape[1]), dtype = 'int64')

print_these_images(classes, test_x_pos_trade, test_y_pos_trade)

train_x_pos = train_x[:,np.squeeze(train_y == 1)]
train_x_neg = train_x[:,np.squeeze(train_y == 0)]
# Permute the negative samples randomly before we pick the ones to trade
perm = np.squeeze(np.random.permutation(train_x_neg.shape[1]))
train_x_neg_perm = train_x_neg[:,perm]

train_x_neg_trade = train_x_neg_perm[:,0:numTrade]
train_x_neg_keep = train_x_neg_perm[:,numTrade:]
train_y_neg_trade = np.zeros((1,train_x_neg_trade.shape[1]), dtype = 'int64')
train_y_neg_keep = np.zeros((1,train_x_neg_keep.shape[1]), dtype = 'int64')
train_y_pos = np.ones((1,train_x_pos.shape[1]), dtype = 'int64')

print_these_images(classes, train_x_neg_trade, train_y_neg_trade)

bal_train_x = np.concatenate((train_x_pos, test_x_pos_trade, train_x_neg_keep), axis=1)
bal_train_y = np.concatenate((train_y_pos, test_y_pos_trade, train_y_neg_keep), axis=1)
bal_test_x = np.concatenate((test_x_pos_keep, test_x_neg, train_x_neg_trade), axis=1)
bal_test_y = np.concatenate((test_y_pos_keep, test_y_neg, train_y_neg_trade), axis=1)

print(f"After rebalance positive samples: train = {np.sum(bal_train_y)}, test = {np.sum(bal_test_y)}")

That block of code references a “print images” function that I created by hacking on the print_mislabeled_images function that they provided:

def print_these_images(classes, X, y):
    Plots images.
    X -- dataset
    y -- true labels
    plt.rcParams['figure.figsize'] = (40.0, 40.0) # set default size of plots
    num_images = X.shape[1]
    for ii in range(num_images):
        plt.subplot(2, num_images, ii + 1)
        plt.imshow(X[:,ii].reshape(64,64,3), interpolation='nearest')
        plt.title("Class: " + classes[y[0,ii]].decode("utf-8"))

I’ll show some experiments using the above code in another post in a few minutes.

1 Like

Before we start on the trading experiments, here’s the output from the training run with the unmodified dataset as shown above:

It’s a little hard to read, but if you show the image in a separate tab, you can read the titles on the mislabeled images. The two false negative cases have the cats in pretty strange positions. There doesn’t seem to be anything common to the false positives, except that a couple of them contain butterflies. I guess you could say that a butterfly’s wing is similar in shape to a cat’s ear. If you don’t look too closely :grin:

Ok, here’s the result of the first run with numTrade = 4 using this test block:

# layers_dims = [12288, 7, 1]
layers_dims = [12288, 20, 7, 5, 1] #  4-layer model
parameters, costs = L_layer_model(bal_train_x, bal_train_y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost = True)
bal_pred_train = predict(bal_train_x, bal_train_y, parameters)
bal_pred_test = predict(bal_test_x, bal_test_y, parameters)
print(f"positive samples: train = {np.sum(bal_train_y)}, test = {np.sum(bal_test_y)}")
print(f"bal_pred_test error count = {np.sum(bal_test_y != bal_pred_test)}")
print(f"bal_pred_test false negatives = {np.sum(bal_pred_test[bal_test_y == 1] == 0)}")
print(f"bal_pred_test false positives = {np.sum(bal_pred_test[bal_test_y == 0] == 1)}")
print_mislabeled_images(classes, bal_test_x, bal_test_y, bal_pred_test)

The numTrade block shows the images that are being traded. Here are the new positives for the training set:

And here are the new negatives for the test set:
Here is the result of the training with the prediction accuracy and error counts:

Cost after iteration 2700: 0.039146425121742476
Cost after iteration 2800: 0.03593977118556143
Cost after iteration 2900: 0.03331602597009896
Cost after iteration 2999: 0.03125850227344208
Accuracy: 0.9999999999999998
Accuracy: 0.8
positive samples: train = 76, test = 29
bal_pred_test error count = 10
bal_pred_test false negatives = 4
bal_pred_test false positives = 6

And here are the mislabeled images:

So this result is a little surprising: adding 4 more positive training samples actually ends up generating a model that has 4 false negatives, instead of 2. The previous 2 false negatives from the “control” case are still there, but I’d say that the two new false negatives should have been easier to recognize as cats than the others that it missed. The total error count went up by 1, but the new model generated 4 false negatives (instead of 2) and 6 false positives (instead of 7). Hmmmm.

Well, as you probably noticed, I wrote the sampling code to randomly shuffle the positive and negative samples before it selects the ones to trade. That means we will get a different set each time we try, even with the same numTrade value. Let’s try again with numTrade = 4 and see what happens. Stay tuned!

1 Like

Ok, here are the traded positives for another try with numTrade = 4:

And here are the traded negatives:

Here are the new accuracy results:

Cost after iteration 2700: 0.049674395970931415
Cost after iteration 2800: 0.04682934688220304
Cost after iteration 2900: 0.043699220183744183
Cost after iteration 2999: 0.041188403451858445
Accuracy: 0.9999999999999998
Accuracy: 0.74
positive samples: train = 76, test = 29
bal_pred_test error count = 13
bal_pred_test false negatives = 4
bal_pred_test false positives = 9

Here are the mislabeled images:

So the results are indeed different with the same number traded and the new accuracy results are quite a bit worse. If you compare the mislabeled images with the “control” output with no changes to the dataset, you’ll see that all 7 of the false positives from the “control” run are still there, but we have two out of the newly added negatives traded in that are new false positives. So that accounts for the total of 9 false positives. But the false negatives make a lot less sense: one of the false negatives from “control” got traded, but the other one is still a false negative. But now we have three brand new false negatives. I’m sorry, but that just does not make sense to me: we gave the training algorithm more positives to learn from, but it does worse and flips 3 images that were correctly labeled as cats before (when it had fewer positive examples to train from) to being “non-cats”.

So maybe the only conclusion here is that with a dataset this small, everything is highly sensitive to the smallest change and you don’t get any smoothing benefits from statistical effects. In other words, the fact that we get as good numbers as we do in the “control” version does say that they chose carefully. If we perturb the balance, we only get worse results. But there’s really probably not that much more to be learned here, since this is basically an unrealistic case. In the “meta” sense as well: too small to make generalizable conclusions. Well, maybe the right way to state the result is that the one generalizable conclusion from all this is that small datasets are a bummer. :nerd_face: