Test Accuracy Higher than Train accuracy?

Don’t add few images into test data. Let the test data be part of neuroimages from the place you have gathered, say you have gathered your data from hospital, laboratory, radiology centre.

My understanding is whatever data of neuroimages you have collected, label the data based on your target attributes and then split the data accordingly based on your dataset amount,
like say you have 10000000 images, then divided 98% training, validation 1%, test 1%
if you have a dataset of 10000 images, then divided 60% training, validation 20%, test 20%.

Read this Functional Data Analysis in Brain Imaging studies Functional Data Analysis in Brain Imaging Studies - PMC

You will understand where you are going in your analysis.

Regards
DP

Yes, my understanding is the same as yours for data distribution. I am doing the same and maybe I need to train my model with more diverse data so that it can generalize the features.
Thank you for sharing the article and can you please share some good statistical courses which can help me in my studies?

Hello Abdul,

There is a statistical course in Deep learning AI but I have not done personally.
Actually I am Dentist, so we already had subjects related to probability and statistics in our college days.
Later I did some certification course as with R programming in Coursera and Statistics with SAS. Although Statistics with SAS was the best course I felt till now for better understanding one needs to know SAS before doing Statistics with SAS and as I did SAS programming I was able to understand it better. These two courses are really good as the instructors are very clear in their speech and using numerous examples to explain different analysis.

P.S. these courses are not free but students can apply for financial aid.

Happy Learning!!!
Regards
DP

Hello Abdul,

there is another way you can split your data with cross-dataset testing. I am sharing you a link, please have a look which explains how to use it. I hope you have an understanding of KFolds.

Hope this helps to move forward!!!

Regards
DP

Hi,

I have a similar issue. I am applying Bayesian Networks (Markov Blanket) for feature selection and prediction on a dataset that has around 68 features which are all categorical and 220 data points originally but out of this 220 data samples, 22 has been allocated for validation and another 22 for test. Therefore in the remaining 176 training data points, I have 228 data points after oversampling it.

I am using pgmpy package for the implementation and it is tricky to use kfold cross validation here because training the bayesian model requires the target node (labeled data) which means the training data (xtrain) has the predictor (y labelled data). Therefore when testing, i performed validatin test on validation dataset and actual test on test dataset.

However, the accuracy and sensitivity is evidently more on the test dataset than on validation set.

This is not the case of voerfitting or undefitting. I checked to ensure data has not been leaked. Is this a true performance or is something else going unnoticed here?

With only 22 examples in the validation and test sets, you’re going to have a lot of variance in the metrics. That may or may not be an issue depending on how big the differences are.

The difference for accuracy between test and validation is around 20%-25% and that of sensitivity is between 15%-20% for bayesian networks and around 30% for SVM models.

Are those the differences, or the actual values?
Can you state them as a table of your results?

I was referring to the difference, not the actual values.

The table above shows the accuracy value, sensititivity value and AUC value for diff data respectively.

1 Like

I would suspect your validation and test sets don’t have the same statistical distribution.

This is not unusual when both are small numbers of examples.

Cross-fold validation is the usual work-around for not having enough data.

I don’t have a lot of experience with small data sets, maybe someone else will have additional advice.

Thanks for the input, I’ll try it out and post the results here for others (if it will be helpful in any way).

Could you elaborate more on how categorical data can have same or similar statistical distribution and help me understand your suggestion? I have been looking online for chi square test, but could not absorb much info.

Usually for numeric data we see the standard deviation, mean etc.

I’m really bad at statistics, maybe someone else can help with this.

No worries. Thank you so much for all the input you provided so far and responding quickly.

Hi @akshata_bm

Adding to what other mentor has guided you in right direction, I have some query first before giving you any suggestions.

Can you please tell me if you chose model with all the features i.e. 68 as per your description.

can I know how have selected the 22 dataset division between validation and test set?

Did you try changing the dataset division to 140 training, 40 validation and 40 test set ??

Based on the image result you shared it clearly does show your validation sample to under representation of your analysis.

To address this you could do more randomisation of sample between set.

Clearly set 3 in the image shared seemed more closer in results, you also need to understand having better accuracy but having lower sensitivity is also not a good analysis of a data analysis when it comes to finding feature selection to your analysis.

Points to ponder
You having training data set with labelled data, should not stop you from using kfold cross validation testing, you just need to be careful about variation that the labelled data is causing in your analysis.

Not much can be stated until we have an idea about your dataset, or what you are trying to find or state with your analysis.

Regards
DP

Hi @Deepti_Prasad

To answer your questions,

  1. No, I have not chosen the model with all 68 features. As mentioned in the image, under each Feature Set, I have chosen first set of feature that I obtained by simply running the Bayesian Model once, as per which it returned only GAD6 was important. After bootstrapping and running the model multiple times, I chose the most frequently occured Markov Blanket Features (which is the set 2 consisting of 3 features) and the third set is the top 4 features that occurred in the bootstrap sampling.

  2. xtrain, xvali, ytrain, yvali=train_test_split(X,y,test_size=0.2, random_state=1234)
    xval, xtest, yval, ytest=train_test_split(xvali,yvali, test_size=0.5, random_state=5122)
    The training set was oversampled after this split.

  3. No I did not try dataset division to 140 training and remaining for test and validation. I went as per the norm that we need more data for training the model. so the ratio was 80/20.

  4. What did you mean by doing more randomisation of sample between set? Something like cross-validation?

  5. Dataset is a health dataset, the samples are answers to certain questions about the patients. Aim is to build a binary classification model that has good sensitivity and good accuracy. I understand either one of them can be achieved to a great extent and compromise on the other. But the results I am getting right now, the gap is way too much.

  6. Also, do you suggest that I first oversample the data and then perform train-test split? Would this leak the data in any way?
    Currently I am using sklearn’s train_test split method with random state as mentioned above and then oversampling the train data only.

Is GAD6 glutamic acid decarboxylase ?

can I know what health dataset you are talking about? I am Dentist, so you can share medical related information as probably I have some good understanding of Medical analysis.

you mentioned you are doing a binary classification based on feature GAD6 which has a 68 categorical features. Can I know why you chose only one feature based on Bayesian statistics.

Chances are that’s why you are seeing these wide gaps as in health or disease are not outcome of single entity but can have dependent factors but to a different degree.

What you could do select top two features selected by Bayesian statistics or select three and then check the results.

Another best way to know which feature is more related to your disease is p-value, Mse analysis, f1 score, these gives more varied approach to understand if a feature alone has affect on a data spread or together! Usually done using R-programming for medical data analysis.

Regards
DP

This is a mental health dataset. I am not quite sure what it stands for but I think it is General Anxiety Disorder.

Yes I have selected the top 4 features given by the Bayesian Network which is the set 3 in the image.

Is p-value the one given by Chi-Squared test?
Also, can you please address my question in point 6 in the previous post?

I cannot suggest this step as I don’t know the complete basis of your analysis.

p-value doesn’t hold just significance with Chi-square test but p-value holds significance in the hypothetic analysis being how true your analysis might be or how close it would be.

Actually I am doubting the approach of your classification,
my reasoning is you are trying to find a binary classification of a disease or disorder with multiple categorical features!!!

I don’t know if you have heard about MANOVA ANALYSIS where multivariate samples are compared based one or more dependent variables.

Here understanding dependent and independent variables(features ) is important as that will lead you to select which type of MANOVA analysis is suitable to your data analysis.

The one-way multivariate analysis of variance (one-way MANOVA) is used to determine whether there are any differences between independent groups on more than one continuous dependent variable where as two-way ANOVA checks scenarios with multiple dependent variables. It is particularly useful in determining how two independent variables interact in their combined influence on several dependent variables.

Regards
DP

That’s right. Why or How do you see that as a problem?

I am trying to apply probabilistic graphical methods (particularly Bayesian networks and markov Blanket) for feature selection and Prediction.

Based on your result or outcome as it is dependent of data points being chosen on randomness and feature selection. Probably the each set you are selecting has a different feature responsible leading to varied gap, so rather approaching on broader perspective will give you more insight into your data spread.

I am not stating your approach is wrong but probably the right word would be not closely related to your daya.

Also remember the Bayesian analysis for feature selection done by you was based on the data available to you, choosing a feature topped the list of probabilities doesn’t mean that has higher chance of getting a correction classification model. In diagnosis, even confounding factors has an effect to a disease, so is the Bayesian analysis missing that point? How would one find? The approach would be MANOVA or checking different dataset with different probabilities features.

Regards
DP