Can data augmentation apply to regression problem other than computer vision or time series problems?

Can data augmentation apply to regression problem other than computer vision or time series problems?

for example, for house price prediction, the model might experience high variance problem, by introducing more data could potentially solve the problem. Consdering limited row of house data, can data augmentation apply to such problem and generate some fake data which can add to training data set?

I do not think it is a good idea.

I always like to try to make hypotheticals like this a little more concrete. Like, suppose a certain zipcode is under represented in your data set. You might be inclined to treat it as a class imbalance problem and add records with that zipcode. But what values would you choose? Seems like if you understood the data well enough to create good training records from thin air, you wouldn’t be in the middle of building a regression model to predict them…you would already know what the price should be. You might say, well I can just add noise to the existing records, or apply some statistically derived value. But again, to make useful records you really need to know the distribution, and if you have few enough records that you are thinking about augmentation, then you might not have a good enough handle on that distribution either. Seems likely to me that you would be hurting your model, not helping it.

One convenient thing about augmentation in computer vision is that we know very well what the true label of our augmented samples should be. For example, we flip an image of a cat and we know it is still an image of cat. However, this may not be true in the house price problem, because unlike the case of labeling a new augmented image, we human may not be good to label the price of a hypotheical house we augment out. Any wrong labeling, as @ai_curious said, can hurt the model trained upon it.

Therefore, while it is pretty straight-forward for augmentation in computer vision, it is way less trivial in problems like the house price prediction. Having said that, it is not entirely impossible for problems like that but it will get more advanced subjects like MCMC involved (not covered in these courses).

If you were seeking for a convenient data augmentation approach for problems like the house price prediction, I don’t think there is any as promising as and as simple as the ways we have for computer vision. However, if you were seeking for a topic to do research on a specific set of data for some objectives of your interest, keywords like “MCMC Data augmentation” should not be a bad starting point. However, do read more for how people use it, and be aware of its limitations.

Cheers,
Raymond

totally agree. I figure there could be some path(might be more advance) that could systematiclly generating more data which could potentially be a option to solve the high variance problem. @rmwkwok mentioned, MCMC could be something explore with.

thanks for the advice! this definitely help!

for sure, feature engineering, alter regularization, and other methods might be first things to try.

1 Like

There is a house price data set of King’s County, Washington US (includes Seattle) that gazillions of people have published analysis of.

https://www.google.com/search?q=kings+county+house+price+data+regression&rlz=1C9BKJA_enUS888US888&oq=kings+county+house+price+data+regression

Lots of examples of exploratory data analyses, synthetic feature creation, advanced normalization, different model architectures ….

Regarding augmentation specifically, you might also do some research on SMOTE - Synthetic Minority Over-sampling TEchnique

Note that rearranging existing data, like rotating and shifting, is much different than inventing entirely new data.