How big a difference indicates overfitting?

hi everybody,

as a rule of thumb, we know that a big difference between the validation and training set error scores indicates the model overfits the data. Yet how big is really big? if it depends, what does it depend?
for instance, for a model that I have been working for some time, I keep on getting 2.1 rmse score for the training set and 3.04 for the validation set. so the validation set error is %50 higher than the training error. Does this indicate that my model overfits the data?


p.s for time complexity purposes, I had previously used a subset of the data and got 3.4 rmse for the validation set. I then fed all the data to the model and got 3.04 error. So more data really helped but I am out of any more data :slight_smile:

Hi @mehmet_baki_deniz! I hope you are doing well.

I use accuracy as a matrix to measure whether a model is overfitting or not. As there is no one answer, I use a threshold of 5%. If training accuracy is 95% and testing is 90%, I count it a good model. But if training accuracy is 95% and testing is 89%, I will try to improve model performance.

Again, there is no one answer. The nature of the problem is also a major factor in setting a threshold. However, I use that 5% threshold for my problems.


hi saif,
thank you for your response.
as this is a regression problem, I use rmse which is not an accuracy based metric.

but would you use MAPE for regression problems so that you can play with percentages?
but I also understand that rmse is more widely used for model evaluation

I use below code for my regression problems to find error and accuracy, where AL is the predicted value and Y is the actual value:

error = (np.abs(AL - Y))/(np.abs(Y))
avg_error = np.mean(error)
percent_error = avg_error*100
print(f"error is {percent_error}%")
accuracy = (1 - avg_error)*100
print(f"accuracy is {accuracy}%")


1 Like