True vs predicted values biased-intercept

Hello developers,
During applying some ML algorithms (catboost and ANN models, …), I am facing a problem of biased-intercept, my prediction values and true values are not fitted on a regression line like in the picture. How can I overcome this drawback. is there such a technique that I can employ to force algorithm avoiding this phenomenon.

Hi @IZZETTIN_ALHALIL

taking a look at your plots, the results are showing heteroscedasticity where variance of prediction of test labels for < 0.6 is higher than the variance for >= 0.6. Also in this area of smaller predictions your model‘s performance on test data seems to by worse compared to training data, predicting systematically higher values than ground truth test data for < 0.6. This tendency of predicting too high is also slightly visible for the pure training data in that area.

Regarding biased intercept: it’s hard to see, but I would assume your blue line is a line through the origin, correct?

Can you share a plot with equal axis?

My suggestions would be:

  • to take a closer look at your residual plots and check if you still have systematics in your feature residua (especially for predictions <0.6) which you can exploit and systematically incorporate into your model via feature engineering so that the model does not anymore predict systematically higher values than ground truth < 0.6
  • to think about going for a probabilistic model (like a Gaussian process) - either in general or also in addition, exploiting the uncertainty of the predictions and understanding how well your feature space is covered with data. You can also visualise the confidence levels of your probabilistic model in your scatter plots. This also allows you to understand where your model „feels“ certain and uncertain which in return allows to make conclusions (e.g. pinpointing) on potential overfitting areas of test data - which I would assume might be the case here for smaller predictions.

If you want to tackle overfitting, you can think about regularising your model more resp. slightly reducing its capacity / complexity! (Also: if you have a too large feature space dimension-wise you could also think about optimising your „data to features“ ratio e.g. via PLS or PCA, getting rid of redundancy in your features by data transformation).

In general it could be helpful if you could explain:

  • a high level description or your problem
  • and modelling approaches (e.g. how many features dimensions you used and why you chose your model architecture, etc. …)

Hope that helps!

Best regards
Christian

Thanks for your valuable response, it was really beneficial for me
the previous plots data was already passing from origin, i corrected this in the original post.
I am not sure about my data behavior, so i upload residual plots below.

scatterplot of the predicted values vs. residuals
image

a histogram of the residuals
image

Q-Q plot of the residuals
image

my data has 12 feature and three targets with 590 examples.
I also wondered about how i can get higher R2 values with this behavior.
until now, i understood that i need to add more examples specially to cover the range in 0.6 and I may clean the outliers.

1 Like

Great!

Well done, @IZZETTIN_ALHALIL!

Now you can check visually the dependency between your features and your residuals (which you already calculated and plotted).

Here you can find a step-by-step guide in the repo how to do this - feel free to just reuse the code:

This step will help you to identify the features which show a strong systematic dependency pattern (e.g. some correlation with your residuals). If you understand why this is the case, you can eliminate these systematics by feature engineering, incorporating more knowledge into your features, so that hopefully the residuals do not show any systematic dependency from your features afterwards - ideally you just see random noise, something like that:

The residuals should be quite normally distributed and they should not show any patterns or correlations with your features, see also this thread.

Feel free to share your visualisations and interpretation of results. Hope that helps, @IZZETTIN_ALHALIL!

Best regards
Christian

1 Like

Oh ok, what do you mean with 3 targets? I actually expected you have a regression problem with one label that you are predicting.

I guess out of the 590 labels, you would use not more than 400 as training set to ensure you have a sufficient test set, right?

Think about the following:

  • if you have a feature space of one dimension you have a really good coverage e.g. if you have 100 labels for training
  • if you have two features, (assuming simplified that your data is uniformely distributed which is probably not the case but anyway just as rough comparison), you would need for the same data coverage 100^2 = 10000 labels
  • for an equally good data coverage with n features, 100^n labels would be needed (again assuming uniform distribution which might not be realistic, but you get the point as rough rule of thumb)

So you see that for 12 dimensions you have a quite small training set. E.g. since 3^{12} = 1728 this would mean you have even less than three training examples per dimension which your model can learn since 400 < 1728 resp. 590 < 1728. Anyway, in reality things might not be that super critical since you do not have a uniform but often rather a normal distribution of data, but qualitatively you really seem to have quite too little data for such a 12 D feature space, dependent on what model you are using for prediction purposes.

If you want to reduce your 12 D feature space (e.g. to 4 or even smaller), you can do this e.g. by eliminating redundancy in your features (e.g. with Partial Least Square transformation (PLS) or Principle Component Analysis (PCA) transformation) to get a better data to feature space ratio. Also for this purpose a minimum code example for PCA or PLS can be worth a look.

E.g. when you calculate the principal components you can first check how much of information is explained by them:
image

Then you can decide to which space (e.g. 4D or so) you want to transform the data so that still the most information in your data set is kept after transformation.

Hope that helps!

Best regards
Christian

1 Like