Hi all,
I applied what I learned about tree ensembles to the house price competition on kaggle. The data set provides 80 features to predict the house price. After some work I was able to get to around 26-27 percentile. There are a few things I wonder may improve the score even further.
I wonder if I was too aggressive in dropping features. In the write up, I first try to detect features that are very similar, and drop the ones whose variances are of prices higher. I did this because tree ensembles aimed at minimal variances when splitting.
Then I also dropped features that are very skewed, in the sense that the data clustered around a certain range, and the remaining ones look like outliers. Aside from these, I guess my biggest questions are how to handle the following features.
-
Porch areas: For your convenience, I cropped one of the images produced in the note.
Let us neglect the NA graph on the right for now. The total number of data points is 1460. As we can see, most houses have little to no porch areas. In the end I dropped all these four columns, and replace it with their sums, plus the wooden deck area, which gives a much better distribution. I wonder if this is too aggressive in that I made the model too conservative. -
Bathrooms. I also replace half baths and full baths with total baths (their sums). Again the graph is here:
.
-
Areas:
What I did was that I dropped the basement unfinished area, and just kept total basement area (in square feet, SF). I replaced 1st and 2nd floor area by their sums. I dropped low quality finished square feet as it is very skewed.
Let me know if there are better ways to handle these features, or in similar situations. Highly appreciated.
chi-yu