Selecting the right model for the task

Suppose that we are given a dataset of house prices and sizes. Then we are given a new value of a house size, and asked to predict the price. One way we can do it, is using linear regression. But I have learnt (elsewhere) that more complex regressions exist, eg polynomial regression. How do we decide whether we should fit a line to our data, or a polynomial curve?

My own hunch, is that one of the ways would be having some theoretical reason. If we had some theoretical reason to believe that the relationship between house size and price is supposed to be linear, that would be a sufficient reason to choose a linear model. In this case, it is hard to think of such a theoretical model, but in the natural sciences one could justify linear modeling with theoretical justifications. Is it a good way to select a model? Is it, when such reasoning is available the best way to select a model ?

Feature selection/engineering is an art. There are guidelines and recommendations, but (at least so far) there’s no reliable way to find the best model for every case. It’s usually up to the AI engineer to try different combinations, and choose the one that works best.

As an AI engineer, it’s a good idea to have a “guess” on the type of relationship (linear vs polynomial) between a feature and the output when feature engineering

2 Likes

When you create the model, as a designer you start simple and then add more complexity, until you get results that are “good enough”.

Creating new features (via polynomial combinations of the existing features) is one way to increase the complexity.

2 Likes

As my friend (hackyon) mentioned i think its your art to choose the best model to be fitted with your data.

but you can use Grid Search for tuning and finding the best parameters for your Algorithm.

Consider this sentence " there is no best Model for your data ".

What I think is that you should first study the relationships among the features of the data. This includes examining how they are correlated with each other and creating graphs to visualize them, along with using many other techniques to determine their nature (linear, polynomial, etc.). This will help you determine the relevant features and decide on the appropriate type of model to use.

1 Like

@Sam_R

I think you should pay attention to:

Data Exploration: To find possible trends, make a picture of the link between size and price.

Model Comparison: Use measures like R-squared and mean squared error to judge models. For figuring out generalizability, cross-validation is very important.

Type of Model: Find a good balance between model fit and ease. If you make your models too complicated, they might overfit and not work well with new data.

They can be put together to find the model that best predicts house prices based on size, giving you both accuracy and ease of use.

Suppose (for the sake of simplicity) that you try to fit a uni-variate model. You would have only one variable, hence there would be no benefit to look at the correlations between variables. How would you decide whether to use a linear or a polynomial model in such a case?

So what guidelines and recommendations are there?

Also, how would you measure the one that works best - simply by performance on the validation set?

See my previous reply on this thread.

  • Training sets are used for training.
  • Validation sets are used to adjust the model parameters (i.e. to avoid overfitting).
  • Test sets are used to verify the performance of the completed system.