Density plot versus R2 value

regarding a regression problem, density plot have the shape test results at left side and training at right side, as shown in figure. It seems that the model overfit the data in the training dataset compared with the test dataset, even though the R2 results for test and training are 0.997 and 0.999, respectively, so I expected the density plots would be better than what looked.
What is the meaning of that, is it possible and is that evidence of overfitting?

image

Hey @IZZETTIN_ALHALIL,
Please help me understand your problem better. First, you have the following R2 results:

Training: 0.999 and Test: 0.997

Also, you have 2 plots as you have attached above. The blue curve, I believe represents the true density distribution of your dataset, and the red one represents the estimated density distribution. The left plot is corresponding to the test dataset, and the right plot is corresponding to the training dataset.

Now, if I am correct on my above description of your problem, then I believe, we can easily state that your model is over-fitting the dataset based on the right curve. Also, your training R^2 score depicts your model is over-fitting the data as well.

Let’s try out a few things here. How about you simplify your regression model, by perhaps decreasing the degree of model’s polynomial function. Also, you can try introducing regularization into your model to see if the extent of over-fitting of your model on the training dataset decreases.

Let us know if this helps.

Cheers,
Elemento

the blue for the estimated and the red for the true

How can I decide that according to R2 values (0.997and 0.999), they are so close and the difference is quite little?

Additionally, the algorithm used is catboost and I applied regularization with a leaf regularization value of 8 as well.

scatter plot of predicted vs true are shown for the sake of clarity
image

I wonder why density plot seems bad compared with scatter plots

1 Like

Hey @IZZETTIN_ALHALIL,
Thanks for the clarification. Let’s break down this problem, step-by-step now. If the red curves represent the true density curves, then I believe it won’t be wrong to say that the training and test sets in your application follows the same distribution pretty much. And that’s why the R^2 scores for both the sets are pretty similar, and it would be difficult to classify it as a case of over-fitting.

Now, in your application, if the test set is truly representative of the data that your model will see once it is deployed, then I believe it’s a good thing that your model is giving such a good performance, and you are good to go. However, if that’s not the case, then I believe you would need to modify your test-set, so that it can be representative of the production data. Then only, we would be able to judge the model in a better way.

Coming back to your problem, if you have the same scenario as the second one, then I believe it would be hard to judge the extent of over-fitting by comparing any metric on the 2 sets (in your case, the metric is R^2). Instead, we would try to analyse the metric on the training set as a stand-alone metric. In your example, you are getting a training R^2 score of 0.999, i.e., the predicted values are close to the true values to a great extent. And hence, when the model will see your production data (which might be different from the training data), it’s highly likely to see a fall in the model’s performance, since the model is very rigid on the training set.


As to this, you can always try experimenting with this value, i.e., increasing it, noting the effect, and then repeating the same process after decreasing it. You can also try using simpler models.


Here, can you please elaborate as to what you mean by “bad”? To me, the density plots look good, and corroborates with the metric values that you have obtained.

Cheers,
Elemento

really appreciate your kind response

I mean that looking at R2 values seems good, however the distribution plots showing that the models capture the details in the training dataset, but in case of the test there is difference where the model cannot capture some details in test dataset ( as the blue line doesnot fit the res line well)
At first, I supposed that the distribution plot should be similar in case of both training and test…but it is not the case

Again, comparing the scatter plots I provide recently with the distribution plots,
the residuals of the test dataset seems small in scatter and a bit larger in the distribution which I didot figure out this issuse

Hey @IZZETTIN_ALHALIL,
Apologies for the delayed response. Let me present my take on this. As per your results, test R^2 score is smaller than the training R^2 score, which means that the density plot won’t fit the true curve of the test set, to the same extent as it fits the true curve of the training set.

I believe this should be resolved by now. The density plot won’t look like the true plot for the test set, since the test R^2 score is smaller than the same for the training set. However, we can’t say for definite as to what extent the produced density plot should differ from the true one, just based on the R^2 scores.

In the end, if your model is performing well on the test set, and you are satisfied with the model’s performance, I don’t think we should wonder too much about this. I hope this helps.

Cheers,
Elemento