Comparaison between data

yahiaoui_asma · June 17, 2022, 7:59pm

can we compare the results of the same data set with different percentages of split?
i split my data into 80% training and 20 test set and my colleague split the same data into 70% training and 30 test set
can we compare our results in terms of coefficient of determination R² and RMSE in training and test set ?

Christian_Simonis · June 17, 2022, 9:45pm

Hi there,

you want to make sure that your data, considering the chosen split is representative of your data and real world problem. E.g. your training set should contain all representative conditions so that your model has the chance to learn all relevant characteristics from the data.

In case this applies here, I assume your friend and you would end up with comparable metrics (with respect to R^2) given that all other (hyper-) parameters and boundary conditions are being equal. I think it would be useful to do also a residuum analysis and check the distribution.

Feel free to to this comparison and interpret or discuss your findings.

A nice way to get familiar with different splits (not necessarily split ratios) is cross validation.
It’s can also be helpful for dealing with overfitting issues.
Feel free to take a look:

Best regards
Christian

Christian_Simonis · June 17, 2022, 10:00pm

Regarding RMSE: when comparing results, please check the length of the vector and how it would impact the result in the formula:

E.g. a strong outlier could impact the metric more due to the quadratic influence of the error. So here it depends whether this point was shuffled in you train or test set…

If there would be a difference, you can check numerator for example and plausibilize why this is the case.

If you conduct the above mentioned residuum analysis for training and test data I would assume you can have a good comparability, especially with respect to residual distribution.

Hope that helps!

yahiaoui_asma · June 24, 2022, 6:43pm

Please what does mean residuum analysis ?

Christian_Simonis · June 24, 2022, 6:48pm

Residuals are differences between predicted output from the model and the labels. Thus, residuals represent the portion of the data not explained by the model.

It’s important to understand:

how are the residuals distributed
that residuals are (hopefully!) not correlated with your features.

Here is an exemplary residuum plot showing how the model error (or model deviation) is distributed:

Topic		Replies	Views
Results evaluation UNQ_C3 AI for Medical Prognosis week-2	7	234	March 14, 2024
Week 1: train/dev/test split Improving Deep Neural Networks: Hyperparameter tun	5	528	December 19, 2022
Differences in Dev and Test set Improving Deep Neural Networks: Hyperparameter tun	6	650	September 23, 2021
How big a difference indicates overfitting? AI Discussions	3	60	March 31, 2023
Train error vs validation error Supervised ML: Regression and Classification week-3	14	352	November 9, 2023

Comparaison between data

Related topics