Test Accuracy Higher than Train accuracy?

Hi! Does anyone have ever faced the issue of testing accuracy being higher than training accuracy?


Sometimes it is due to having too few examples, so that you have statistical variance between the training and test sets.

Or it could be any of a hundred other reasons. It really depends on the situation.

although getting higher testing accuracy than training accuracy is not always an issue with model if there is marginal difference of accuracy between training and test set, but in case your testing accuracy is way higher than training accuracy, then there is a problem as the test accuracy should not be higher than training set as the model is optimized for the training set. This can happen

  1. when you have not use the same source dataset for test. A proper splitting between a train/test split in which both of them have the same underlying distribution. Most likely one must have provided a completely different (and more agreeable) dataset for the test set.
  2. Unreasonable higher degree of regularisation applied. Some elements of test set data distribution might be different for this behaviour to occur.

have a look on some of the attached pictures for better understanding.

Keep Learning!!!


Thank you for the response. I am working with the neuroimages of brain.
I got 6 .nii files for each subject from SPM 12 tools after normalization. After it, I convert them into images and saved them. Then, used random command to mix up the data of all subjects for each class and then split it into 25% unlabeled and 75% labeled data. The labeled data is further divided into train, val, and test of ratios 80%, 10%, and 10%. I am using a contrastive learning technique. I got 81% training and validation accuracy but the testing accuracy was 89%.
These are the results of using pytorchlighting trainer module.

I always saved model weights after training and load into a separate file for second-time testing to check test acc is consistent or not. The strange thing occurs as follows:

  1. If I load the test data and shuffle it, The accuracy is around 94%.
  2. If I load the test data and don’t shuffle it, The accuracy is around 78%.

Above are the issues I am facing. If you don’t understand something please let me know I will share it with you but I need to understand this issue and resolve it.
Kindly help me with your experience.

Thank you.

your answer explains the issue. you said you are using contrastive learning technique which allows one to train encoders to learn from massive amounts of unlabelled data, in self supervised setting, and as you can see you have split it into 25:75 ratio of unlabelled and labeled data, your accuracy shows better testing accuracy as you are using labeled data for for model analysis. Did you try training the model without labelling the data and check for accuracy??

Because you shuffling the test data after loading, it is again getting randomised and giving you better accuracy.


Actually, I am using SIMCLR architecture and as my understanding, we first help the model to learn features from unlabeled data and once it is done we then use the model for training and testing.
In this way, I cannot check the model accuracy for unlabeled data.
I have done some more experiments by increasing the ratio of test data from 10% to 20%. The problem is resolved very much.

I have a small query on this, Is the validation data should be equal to the testing data? or If they are different will not affect the future result of the model.

okay, thank you I understand the randomness is the issue of getting sudden high accuracy. While testing we don’t shuffle data.

Hello Abdul,

As you know validation data is used during the training of the model which gives an unbiased evaluation of how your model is performing so that you can improve the model performance whereas test data is completely unused data used after the model has been trained to assess the model’s performance. So by this one should understanding validation data need not be same as test data. But they do say one need use test data from the same distribution as training data.

how to know which distribution to use based on your sampling?
You must use the t-distribution table when working problems when the population standard deviation (σ) is not known and the sample size is small (n<30).
Principle :

  1. If σ is not known, then using t-distribution is correct.
  2. If σ is known, then using the normal distribution is correct, but make sure training data and test data come from same distribution.

if you have followed all these criteria and still getting varying results with your model performance, then look at other model algorithm like, batch normalisation, regularization, gradient descent, learning rate, etc.


Thank You for the guidance. Can you explain to me in some detail about the distribution check? I have a neuroimage dataset of 5 classes and 35 subjects I am using. I don’t get here what is meant by checking the population standard deviation (σ). Also, what does the term sampling and sample size refer to in my dataset?

can I know what are these 5 classes and 35 subjects you are mentioning?? also on what basis are you training your model?

classes are the Alzheimer’s disease (AD) stages and subjects mean patients. The dataset details in images are approx 5k images for each class. I am training my model for classifying stages of AD.

Subjects are your sample or in your case more particular if I say samples are patient’s images that you are using to classify the patients. These questions I am asking you to explain how standard deviation and distribution applies to your dataset.

What I can understand you have some neuroimages in which there some features based on which you are going to classify these patients in subcategories. So those features for the sampling dataset will create your findings eventually. You will see a pattern in the analysis. So in that analysis, you see some relationship between sat age factor and your main feature(which I don’t know), so that relationship gives you a distribution.

Normal distribution, also known as the Gaussian distribution is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.

So first you need to find if your data has a normal distribution.

please refer this link Testing Distributions — Data Science in Practice , to check if your data has what kind of distribution. Once you understand this, you will clearly understand how and why your test data accuracy worked or failed.


I need to check the Distribution of my training, test, and validate data like checking images distribution subjects-wise for each class in each dataset and wether each subject some representation is present or not in each dataset distribution. Am I right?

See your training data is your data which test data checks for accuracy after it was trained. So check the distribution of training and validation data, and then apply the same statistical analysis to check your test data for training your model.

Excellent responses and explanations, and fabulous discussion between both participants.

Can you kindly tell me where or what book you got these screenshots from ?

Hi!. I have studied the reference link. It is really helpful for me to code and analyze my dataset. I am sharing the snips for review and you may please comment on this with your experience. I have plotted the train, val, and test data set and along the distribution curve. Also perform the normal test for train, test, and val, and the t-test for train, val first and then for train, test data. all tests are performed on ages. and plots are age vs samples.

1 Like

All screenshots are part of DLS specialisation and I only use for my personal pointer while I do a course, sometime I chose this screenshots as reference in community learning.AI as part of learning. I do not share outside of the community. These screenshots help me, whenever I am doing an assignment I come across some doubts related to particular topic, I go back to my folder of screenshots related to the topics to get my doubt clarified.

It just my way of learning, Happy learning!!!


@Deepti_Prasad From the results I understand that distribution is the same for train, test, and val. But test acc higher than training than what do you think will be a problem?

Hey Abdul,

That’s great job.

Now I have some questions based on your results.

When do we reject null hypothesis in t-test

If the absolute value of the t-value is greater than the critical value , you reject the null hypothesis. If the absolute value of the t-value is less than the critical value, you fail to reject the null hypothesis.

The increase in the absolute value of the t-value signifies that the difference between the sample data and the null hypothesis is also increasing. However, the analysis doesn’t stop there. The difficulty is that the t-value is a unitless statistic and isn’t very informative by itself.

I will be honest, based on your studies when you mentioned you are testing your samples only with your age, you are limiting your analysis. As according to me as a clinician, I know neuro changes are not limited to age only. So the basis of comparing your neuroimages with only age will again give you superficial analysis.

My sincere suggestion is to include other necessary factors like medical history, family history, any unhealthy habits, mental health, you can include IQ too if you have data about this.

your p value is 1.0 which is way greater than 0.001 or even 0.01 and that is why null hypothesis cannot be rejected. failing to reject null hypothesis indicates your hypothesis test are not statistically significant.

Usually in such test the p-value, Square root mean error, coefficient of variance matter. Honestly all three distributions of test, train and val looks same as you are same data which you first divided into labeled and unlabelled, and again in labeled you divided the data.

introduction of test data should be based on same distribution of train data that doesn’t mean it should be identical and as you are using the same data images you have, the hypothesis is not fitting here. Here what you could do is as you have selected train dataset already, select other neuroimages which is not part of this dataset from the same place you have gathered your neuroimages, do not tell the person or authority to give you different data but just any random data of neuroimages which might or might not include your train data. I hope you are understanding where I am going.

Please don’t be upset with these setbacks. you seem to be really dedicated in this project. You are putting your efforts, honestly health analysis requires lot statistical analysis. I suggest you to have better understanding about statistical methods before going ahead in such projects.


If you are asking in general terms, you are basically trying to find a correlation if x is related to y based on age factor, but most of the health analysis even if you are using only radiographic images requires other factors to be included to test your hypothesis.

@Deepti_Prasad Actually this is my first time studying statistical methods and applying them. I don’t understand them if you have any good courses to recommend kindly recommend them.

Yes, You are right. I have created a dataset and split it into unlabeled, train, test, and validation. I also tested my model by adding a few images into test data outside which is not the part. Then, accuracy reduces. But I am not understanding why it is happening. Normally, we take a dataset and split it into train, val, and test. Train model and test model on it. Same I do it with my work.

What do you think why this is happening?