XGBoost for house price competitions

u5470152 · April 11, 2023, 7:27pm

Hi all,

I applied what I learned about tree ensembles to the house price competition on kaggle. The data set provides 80 features to predict the house price. After some work I was able to get to around 26-27 percentile. There are a few things I wonder may improve the score even further.

I wonder if I was too aggressive in dropping features. In the write up, I first try to detect features that are very similar, and drop the ones whose variances are of prices higher. I did this because tree ensembles aimed at minimal variances when splitting.
Then I also dropped features that are very skewed, in the sense that the data clustered around a certain range, and the remaining ones look like outliers. Aside from these, I guess my biggest questions are how to handle the following features.

Porch areas: For your convenience, I cropped one of the images produced in the note.

__results___33_01010×844 66.6 KB

Let us neglect the NA graph on the right for now. The total number of data points is 1460. As we can see, most houses have little to no porch areas. In the end I dropped all these four columns, and replace it with their sums, plus the wooden deck area, which gives a much better distribution. I wonder if this is too aggressive in that I made the model too conservative.
Bathrooms. I also replace half baths and full baths with total baths (their sums). Again the graph is here:

__results___33_01008×827 60 KB
.
Areas:

__results___33_01010×1040 84.6 KB

What I did was that I dropped the basement unfinished area, and just kept total basement area (in square feet, SF). I replaced 1st and 2nd floor area by their sums. I dropped low quality finished square feet as it is very skewed.

Let me know if there are better ways to handle these features, or in similar situations. Highly appreciated.

chi-yu

u5470152 · April 12, 2023, 2:04am

There are plenty of codes on kaggle that scored well. Some of them requires other techniques that was not covered in the course. I was just trying to focus on what I learned in the course first, and improved from here.

The main reason was that before I did any feature engineering, the XGBoost overfitted like crazy. I then read somewhere that repeated features should be dropped. I also saw people did Box Cox transform to skewed data, although I have not read those carefully for now.

Thanks,
chi-yu

rmwkwok · April 12, 2023, 6:20am

Hello chi-yu,

For whether it was too aggressive, I believe the performance change due to it might tell?

The downside of the sums is that it makes samples less distinguishable from one another, which, I believe, is preceived as upside if your objective is to suppress overfiting by all means? (Though the sum might not be very interpretable. (e.g. why half bath room is 0.5?))

For another approach to “group” features other than taking sums, check this out. It might help combat overfitting since it further limits choosable features.

Cheers,
Raymond

u5470152 · April 12, 2023, 8:24pm

So if I understand you correctly, instead of dropping the similar features and replace them with a single sum, an alternative is to group them as feature interaction constraints.

In this case, if there is a tree that has a node that splits on any grouped feature, then the tree only splits on the features in the group. If we have enough trees in the forest, this may not affect the result seriously. This is probably better than each similar feature interacts with other features elsewhere?

Thanks for the pro tips,

An update: I tried including the dropped features back, together with the engineered features. The performance (cross validation) score drops a bit. Maybe I will try feature constraint as well.

Further update: I tried grouping the porch areas together. The cross validation relapsed significantly. Maybe I am not interpreting what @rmwkwok is trying to convey.
chi-yu

rmwkwok · April 13, 2023, 1:02am

Hey chi-yu,

For some reasons I am interested in that too. I am looking into it too. There will be time that I need to wait for my laptop to run the models, so if you update your notebook with the interaction constraints, please let me know and I will have a look.

Raymond

PS: Besides the Porch features, there are many Garage features and many Basement features. Maybe you can try 2 more groups?

u5470152 · April 13, 2023, 7:36pm

Hi Raymond, @rmwkwok , I have prepared a separate note book for the purpose of our discussions. If you want to look at the diagrams of porch areas and half rooms, please check out the images I posted earlier.

Near the end I provided four models. The first two are my original ones. In the last two I added the porch areas and ful/half rooms back without any constraint. The cv score already dropped. The cv score dropped further when I grouped these two set of features. Let me know if I am doing the thing you hope to see.

chi-yu

rmwkwok · April 13, 2023, 11:47pm

Hello chi-yu @u5470152,

Thank you. I think you are implementing the interaction constraints correctly.

Raymond

rmwkwok · April 14, 2023, 9:52am

Hi chi-yu @u5470152

I have a total of 11 constraints (i.e. 11 lists of grouped features).
x-axis is for using different numbers of constraints. 0 means 0% (or no constraints), 1 means 100% (or all constraints)
y-axis is the mean of losses (sample size = ‘k’ in k-Fold CV = 4)
each data point represents a different set of hyperparameters
(observation) seems more constraints helped.

Raymond

PS: my 11 constraints:

[
    ['Neighborhood', 'Condition1', ],
    ['YearBuilt', 'YearRemodAdd', 'OverallQual', 'OverallCond'] + ['BuiltToSold', 'RemodToSold'],
    ['MSZoning', 'Street', 'Alley'] + ['LotFrontage', 'LotArea', 'LotShape', 'LotConfig'] + ['LandContour', 'LandSlope'] + ['BldgType', 'HouseStyle', ],
    ['RoofStyle', 'RoofMatl', 'Exterior1st', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond'] + ['BldgType', 'HouseStyle', ],
    ['Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2',  'BsmtUnfSF', 'TotalBsmtSF'],
    ['Heating', 'HeatingQC', 'CentralAir', 'Electrical', ] + ['Fireplaces', 'FireplaceQu'] + ['Functional'],
    ['1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', ] + ['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',],
    ['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive'],
    ['OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch'],
    ['PoolArea', 'PoolQC'] + ['MiscFeature', 'MiscVal'],
    ['YrSold', 'MoSold', 'MoSold2'] + ['SaleType', 'SaleCondition'] + ['BuiltToSold', 'RemodToSold'],
]

u5470152 · April 14, 2023, 8:38pm

Thanks for the hard works! If I understand interaction constraints correctly, if you choose 11 constraints, you are actually choosing 12 constraints (the non selected ones become a group automatically). I am not sure if you grouped all features. If this is the case then there are 11 groups.

So I guess one take away here is that if we choose to group features, we better group all similar features?

Just curious, which score did you use? The root mean squared error? If so I think the result is not bad! I assume you have not tuned the hyperparameters much.
chi-yu

rmwkwok · April 15, 2023, 4:52am

Hello chi-yu @u5470152

I agree with the 12th group, but I think (not verified yet) it should be a virtual group of all of the features, instead of the left-out ones.

If you check out the the graph in the bottom, and the paragraph before, I think we can conclude that, the child node has to be in the same group as its immediate parent. Therefore, if a parent node is not in any group, its child can be any feature.

I am still understanding this thing. So far I think we want to group features from which we want to somehow engineer new features out. It is naturally to think that they are similar features, as you pointed out, but it is not limited to that.

root mean squared error, but my models are trained to predict the natural log of the labels, instead of the raw labels.

They are not leaderboard scores. I expect them to be a little bit better there.

not much. I am running more experiments. My focus is on how to group the features. One direction is deeper trees. I think you have used max_depth=2 but that wouldn’t leverage much of the interaction constraints. I am looking into things like min_child_weight to make deeper trees but still regularized.

Cheers,
Raymond

Topic		Replies	Views
Clustering and Regression for House Price Prediction AI Discussions	7	38	August 31, 2023
Feature for increasing prices in apartments and/or house listings Supervised ML: Regression and Classification week-2	16	540	December 16, 2022
Is implementation of xgboost as easy as it is in the lecture Advanced Learning Algorithms week-4	4	495	January 18, 2023
House price prediction is completely off Introduction to TF for Artificial Intelligence ... week-1	7	651	December 26, 2021
Questions for pricing houses, and supervised machine learning Supervised ML: Regression and Classification week-1	2	234	February 28, 2024

XGBoost for house price competitions

Related topics