Suppose that we are given a dataset of house prices and sizes. Then we are given a new value of a house size, and asked to predict the price. One way we can do it, is using linear regression. But I have learnt (elsewhere) that more complex regressions exist, eg polynomial regression. How do we decide whether we should fit a line to our data, or a polynomial curve?
My own hunch, is that one of the ways would be having some theoretical reason. If we had some theoretical reason to believe that the relationship between house size and price is supposed to be linear, that would be a sufficient reason to choose a linear model. In this case, it is hard to think of such a theoretical model, but in the natural sciences one could justify linear modeling with theoretical justifications. Is it a good way to select a model? Is it, when such reasoning is available the best way to select a model ?
Feature selection/engineering is an art. There are guidelines and recommendations, but (at least so far) there’s no reliable way to find the best model for every case. It’s usually up to the AI engineer to try different combinations, and choose the one that works best.
As an AI engineer, it’s a good idea to have a “guess” on the type of relationship (linear vs polynomial) between a feature and the output when feature engineering
What I think is that you should first study the relationships among the features of the data. This includes examining how they are correlated with each other and creating graphs to visualize them, along with using many other techniques to determine their nature (linear, polynomial, etc.). This will help you determine the relevant features and decide on the appropriate type of model to use.
Data Exploration: To find possible trends, make a picture of the link between size and price.
Model Comparison: Use measures like R-squared and mean squared error to judge models. For figuring out generalizability, cross-validation is very important.
Type of Model: Find a good balance between model fit and ease. If you make your models too complicated, they might overfit and not work well with new data.
They can be put together to find the model that best predicts house prices based on size, giving you both accuracy and ease of use.
Suppose (for the sake of simplicity) that you try to fit a uni-variate model. You would have only one variable, hence there would be no benefit to look at the correlations between variables. How would you decide whether to use a linear or a polynomial model in such a case?