Error Analysis for Regressions

This course has talked about error analysis for classifications, such as splitting up dev set error into categories to see which class accounts for most of the error. How would you do manual error analysis for a regression? It seems to me like you could only do bias/variance analysis

Hi, Max Rivera.

Good question!

Here’s a link that possibly could justify your query in an appropriate fashion.

Let us know if you understood it well, otherwise we can have our discussion further on this platform.

Just to clarify the article talks about classification problems, but my question is specifically about regressions. Let me know if I am missing something here

Hello, Max Rivera.

Yes, you are right. I tried to clarified that for classification-based problems, ignoring the fact that you had asked about regression-based error calculations.

So, let me explain this here for you. The analysis that helps in finding the connection between one dependent variable and a few independent variables, is termed as Regression analysis. This analysis is a way of sorting out which of the variables will be more efficient to conduct linear based experiments in a proficient manner. This analysis can be done for linear to non-linear and parametric to non-parametric models. where independent variables have a significant impact on the dependent variables.

Almost six kinds of regression based models can be seen:

1- Linear regression model 2. non-linear regression model 3. Simple Linear regression Model
4. Polynomial Regression 5. Moving average model 6. Differencing

Linear Regression models identifies the relationship generated between dependent and independent variables.


Non-linear regression can be represented with the following formula Y = a0 + b1X2 and graph:
non-linear
Simple Linear Regression involves the relationship between an independent variable X and an independent variable Y as (Y = a+bX)
simple linear regression_formula

Polynomial Regression comprises of : Y = XB + U (where Y = Dependent, X = independent variable, B = parameter that needs to be calculated and U = errors predicted)
polynomial regression
This link provides a detailed explanation on calculation techniques.
In linear regression based models, mean squared error (mse) is mostly used to calculate the error of the model. It uses the following principles (Wikipedia)
a) measuring the distance of the observed Y-values from the predicted values at each value of X.
b) squaring each of the distance
c) calculating the mean of each of the squared distance.
It is normally asked which of the two, regression or classification based models are difficult to predict. So, it depends a lot on the availability and quality of the data. Regression problems need to be predicted accurately for each of the defined entry points for a particular feature space to reduce the error so that the model could be well defined.

Hello @Max_Rivera
I’m also a new learner, and feel interested by this question. Through DLS course 1-3, most of examples are about classification problem.

The error analysis for classification is quite simple to understand, meanwhile for regression, above answer from @Rashmi is way more technical, it could be too advanced or complicated. So I want to add a few comments.

In classification error analysis for dev set (for example cat/no-cat prob), we can split errors to categories such as: human tagging error, images too blur, dog like cat, etc… And can easily classify them by ourself.

For a regression problem, for example predicting house’s price based on area, number of rooms, distance, etc… How can we categorise the error? It seems we cannot do it as classification at all.
Let say with an unusual sample with very high price, how do we know what caused that, probably it needs to verify with another source of truth like a database somewhere and figure out the reason (human input incorrectly or system issue like error in data collecting/processing).

My thought is the mathematic explanation above can help to identify the samples with high error, but won’t be able/enough to classify what caused it, so that we can improve quality of dev set.

1 Like

Hi, Sangdinh.

Good observation.

Yes, there are several other ways that one can minimize error by going through certain error analysis techniques. What I had mentioned in my previous reply, were the ways, Max Rivera had asked about ‘error analysis done for regression-based models’. Well, I missed a few of them. Here’s a good read that will give you an idea, what other techniques, we can apply before we put our model in production and minimize the error.

1 Like

I think your answer makes a lot of sense. Because you don’t have class labels like in a classification problem, you would just have to divide the data in whichever categories are most appropriate for your problem, in your example it was the data source. Thanks!