High cost and High accuracy for multiclass classification (MNIST and CIFAR-10 Datasets)

Screenshot 2022-07-23 204446
I have built a 3 Layer ANN from scratch to classify the MNIST dataset and a 5 Layer ANN to classify the CIFAR-10 dataset.
Even though the accuracy for the MNIST model reaches very close to 99-100%, its cost also increases. The same is the case with the CIFAR-10 model, although its accuracy remains in the sub-range of 95%.
I want to know if the models are overfitting to the training set and if not how is the NN learning to classify despite such high loss.

NN Architecture I am using to classify MNIST:
Layer 1: 100 units; activation: ReLU
Layer 2: 60 units; activation: Sigmoid
Output Layer: 10 units; activation: Softmax
Epoch: 700
Accuracy: 98-99%
Cost function: Cross Entropy #python implementation: (-1/m) * np.sum(Y * np.log(A_L))
Cost: starts from 100 and goes up to 430

NN Architecture I am using to classify CIFAR-10:
Layer 1: 500 units; activation: Sigmoid
Layer 2: 500 units; activation: ReLU
Layer 3: 250 units; activation: ReLU
Layer 4: 250 units; activation: Sigmoid
Output Layer: 10 units; activation: Softmax
Epoch: 1000
Accuracy: 95-98%
Optimizer: Adam
Cost function: Cross Entropy #python implementation: (-1/m) * np.sum(Y * np.log(A_L))
Cost: starts from 100 and goes up to 600

Both these models are done from scratch.

What could be going wrong in both these NN? Can someone help me correct the high cost I am getting with these models?

PS: I know the cost function and the accuracy are not related linearly and the poor confidence in classifying one of the examples may lead to a small accuracy dip but higher cost.

It’s great that you are taking the course knowledge and trying to apply it to real problems. You always learn something interesting and useful when you try that.

I agree that it doesn’t make sense that the cost and the accuracy both go up. They should be moving in opposite directions. So that suggests that your gradient implementation is correct, because the accuracy goes up. But the actual way you are computing J must have some issues. E.g. are you sure you have converted the labels to “one hot” form and not the normal categorical form? You need to do that in order for that formulation of cost that you show to work correctly, right?

Thank you for your response, yes my labels are One Hot Encoded.

Did you build your own softmax implementation or use one from someplace else? Have you tried printing out your AL values to make sure they look sensible?

You are right to make the point that the relationship between accuracy and cost is not that predictable: accuracy is quantized, whereas cost is not. So in a binary classification case, if you have a sample with label 1 and the \hat{y} is 0.53 at 1000 iterations and then gets to 0.65 at 2000 iterations, the cost will go down, but the accuracy stays the same. But in your case, you are looking at the aggregate cost averaged across all the samples, so there should be a statistical relationship: higher accuracy should correspond to lower cost.

This is how one of the prediction array looks in the AL.

array([0.02499799, 0.02373721, 0.0271745 , 0.18253918, 0.06547103,
0.15893156, 0.42345624, 0.06175484, 0.00841449, 0.02352295])

Since they all add to 1, we can say the softmax is working correctly.
Yes, I have implemented the softmax implementation myself.

Ok, that all looks good. I haven’t tried using CIFAR for anything, but I did the same thing you are describing with MNIST a few years back: took the C1 W4 A2 code and then generalized it to add softmax and trained a model on the full MNIST dataset. I was able to get similar accuracy numbers and don’t remember having any surprising behavior with the costs.

I can try to compare your code to mine and see if I can see anything. If you want to DM me the code, I’ll try to see if I can help. My day today is pretty heavily scheduled already, but I would hope to have time to look in the next 8 or 10 hours. I’m in California, so it’s UTC-7 this time of year. Just finished breakfast and now have to start doing errands.

I would highly appreciate it if you could help me debug the code.
How can I DM you the code?

As it turned out, I had a code bug that made the cost turn out high even though all things were implemented correctly. Thank you for your kind responses. We can close the topic for now.

It’s great news that you were able to find the issue under your own power. Keep us posted!


1 Like

Hi, Hrishabh Tiwari.

Good to hear that! Could you please specify what code error wasn’t making this working, so that other learners could have an idea, if they try to implement the same experiment in future. Thanks in advance!

I recommend you to use dropout in the first 2-3 layers it will decrease your training set accuracy while your test set accuracy will increase.
You can search how to use dropout on the library which you are using.
good luck .

Yeah, the high cost was because I was computing the cost with the original target values rather than the categorical (one hot encoded) values.

Thanks for letting us know what was causing an error in your case!

Mentioning is quite a good practice, so that other learners too can get an idea, if they have the similar queries :slight_smile: