XGBoost for house price competitions

Hi all,

I applied what I learned about tree ensembles to the house price competition on kaggle. The data set provides 80 features to predict the house price. After some work I was able to get to around 26-27 percentile. There are a few things I wonder may improve the score even further.

I wonder if I was too aggressive in dropping features. In the write up, I first try to detect features that are very similar, and drop the ones whose variances are of prices higher. I did this because tree ensembles aimed at minimal variances when splitting.
Then I also dropped features that are very skewed, in the sense that the data clustered around a certain range, and the remaining ones look like outliers. Aside from these, I guess my biggest questions are how to handle the following features.

  1. Porch areas: For your convenience, I cropped one of the images produced in the note.

    Let us neglect the NA graph on the right for now. The total number of data points is 1460. As we can see, most houses have little to no porch areas. In the end I dropped all these four columns, and replace it with their sums, plus the wooden deck area, which gives a much better distribution. I wonder if this is too aggressive in that I made the model too conservative.

  2. Bathrooms. I also replace half baths and full baths with total baths (their sums). Again the graph is here:


  3. Areas:

    What I did was that I dropped the basement unfinished area, and just kept total basement area (in square feet, SF). I replaced 1st and 2nd floor area by their sums. I dropped low quality finished square feet as it is very skewed.

Let me know if there are better ways to handle these features, or in similar situations. Highly appreciated.


There are plenty of codes on kaggle that scored well. Some of them requires other techniques that was not covered in the course. I was just trying to focus on what I learned in the course first, and improved from here.

The main reason was that before I did any feature engineering, the XGBoost overfitted like crazy. I then read somewhere that repeated features should be dropped. I also saw people did Box Cox transform to skewed data, although I have not read those carefully for now.


Hello chi-yu,

For whether it was too aggressive, I believe the performance change due to it might tell?

The downside of the sums is that it makes samples less distinguishable from one another, which, I believe, is preceived as upside if your objective is to suppress overfiting by all means? (Though the sum might not be very interpretable. (e.g. why half bath room is 0.5?))

For another approach to “group” features other than taking sums, check this out. It might help combat overfitting since it further limits choosable features.


So if I understand you correctly, instead of dropping the similar features and replace them with a single sum, an alternative is to group them as feature interaction constraints.

In this case, if there is a tree that has a node that splits on any grouped feature, then the tree only splits on the features in the group. If we have enough trees in the forest, this may not affect the result seriously. This is probably better than each similar feature interacts with other features elsewhere?

Thanks for the pro tips,

An update: I tried including the dropped features back, together with the engineered features. The performance (cross validation) score drops a bit. Maybe I will try feature constraint as well.

Further update: I tried grouping the porch areas together. The cross validation relapsed significantly. Maybe I am not interpreting what @rmwkwok is trying to convey.

Hey chi-yu,

For some reasons I am interested in that too. I am looking into it too. There will be time that I need to wait for my laptop to run the models, so if you update your notebook with the interaction constraints, please let me know and I will have a look.


PS: Besides the Porch features, there are many Garage features and many Basement features. Maybe you can try 2 more groups?

Hi Raymond, @rmwkwok , I have prepared a separate note book for the purpose of our discussions. If you want to look at the diagrams of porch areas and half rooms, please check out the images I posted earlier.

Near the end I provided four models. The first two are my original ones. In the last two I added the porch areas and ful/half rooms back without any constraint. The cv score already dropped. The cv score dropped further when I grouped these two set of features. Let me know if I am doing the thing you hope to see.


Hello chi-yu @u5470152,

Thank you. I think you are implementing the interaction constraints correctly.


Hi chi-yu @u5470152

  • I have a total of 11 constraints (i.e. 11 lists of grouped features).
  • x-axis is for using different numbers of constraints. 0 means 0% (or no constraints), 1 means 100% (or all constraints)
  • y-axis is the mean of losses (sample size = ‘k’ in k-Fold CV = 4)
  • each data point represents a different set of hyperparameters
  • (observation) seems more constraints helped.


PS: my 11 constraints:

    ['Neighborhood', 'Condition1', ],
    ['YearBuilt', 'YearRemodAdd', 'OverallQual', 'OverallCond'] + ['BuiltToSold', 'RemodToSold'],
    ['MSZoning', 'Street', 'Alley'] + ['LotFrontage', 'LotArea', 'LotShape', 'LotConfig'] + ['LandContour', 'LandSlope'] + ['BldgType', 'HouseStyle', ],
    ['RoofStyle', 'RoofMatl', 'Exterior1st', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond'] + ['BldgType', 'HouseStyle', ],
    ['Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2',  'BsmtUnfSF', 'TotalBsmtSF'],
    ['Heating', 'HeatingQC', 'CentralAir', 'Electrical', ] + ['Fireplaces', 'FireplaceQu'] + ['Functional'],
    ['1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', ] + ['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',],
    ['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive'],
    ['OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch'],
    ['PoolArea', 'PoolQC'] + ['MiscFeature', 'MiscVal'],
    ['YrSold', 'MoSold', 'MoSold2'] + ['SaleType', 'SaleCondition'] + ['BuiltToSold', 'RemodToSold'],

Thanks for the hard works! If I understand interaction constraints correctly, if you choose 11 constraints, you are actually choosing 12 constraints (the non selected ones become a group automatically). I am not sure if you grouped all features. If this is the case then there are 11 groups.

So I guess one take away here is that if we choose to group features, we better group all similar features?

Just curious, which score did you use? The root mean squared error? If so I think the result is not bad! I assume you have not tuned the hyperparameters much.

Hello chi-yu @u5470152

I agree with the 12th group, but I think (not verified yet) it should be a virtual group of all of the features, instead of the left-out ones.

If you check out the the graph in the bottom, and the paragraph before, I think we can conclude that, the child node has to be in the same group as its immediate parent. Therefore, if a parent node is not in any group, its child can be any feature.

I am still understanding this thing. So far I think we want to group features from which we want to somehow engineer new features out. It is naturally to think that they are similar features, as you pointed out, but it is not limited to that.

root mean squared error, but my models are trained to predict the natural log of the labels, instead of the raw labels.

They are not leaderboard scores. I expect them to be a little bit better there.

not much. I am running more experiments. My focus is on how to group the features. One direction is deeper trees. I think you have used max_depth=2 but that wouldn’t leverage much of the interaction constraints. I am looking into things like min_child_weight to make deeper trees but still regularized.