I would like some help to shed some light on how to incorporate a feature for taking care of increasing prices in the housing market. For example under Covid the housing market seems to have been positively impacted from central bank stimulus here in Sweden, so therefor i guess that non of the classical features e.g. squared meter size, listing price, number of bedrooms etc. can account for that. Does anyone have some suggestions on how to handle this? I guess one could retrain a multiple linear regression model on a regular basis with new incoming data of sold houses/apartments but that doesn’t seem like a good solution. Just to be clear i was thinking of a model like
house price = B0 + Squared meter size * B1 + number of bedrooms * B2 + …
I think you would agree that ultimately we want our model to take into account all relevant factors in a correct manner.
In your example model:
you are suggesting 2 relevant factors, but as Tom pointed out, you might want to add something extra to take into account the central bank’s activity, and you want to add them in a correct manner. So the first question is, what kind of features do you have that talk about the central bank’s activity? They could be direct or indirect features. And the second question is, how does the price react to those extra features?
To answer the 2nd question, I think you might want to do some data analysis first, before adding them to the end of your example model. Do you have those extra features, and would you have some analysis that you can share with us?
Some thoughts on this case based on @rmwkwok comment:
If the model was built pre-covid, there was no way to know that a pandemic was coming, so the model was built with the historical data available, and it learned to predict prices based on that history.
The comes Covid. House pricing changes abruptly due to changes in external factors. I would say that this is a case of skew in the data due to these external factors. At this point the model can certainly fail in the predictions. As a first attempt to make the model work again, I would retrain it with the new data plus a couple of additional features: interest rates issued by the central bank, and mortgage rates. This might help the model learn to predict new pricing based on changes from these two additional features. I might even add inflation rate.
We can expect that in a post-Covid environment, the central bank will eventually start lowering rates again, and mortgate rates as well as inflation will also start dropping as a consequence. So with these 3 additional features we may enhance the predictive capabilities of the model.
Regarding COVID cases that’s for sure a data point that is available but I guess that data point would negatively, for a start, correlate with the housing prices before the central banks actively intervene. This is just a guess of course, but one more problem that will arise, given that I would like to use this model in the future would be that i guess that covid casis on a regular basis won’t be measures as it have been in the past. So left is to find some central bank activity measure that is continuely being measured I guess.
The model example was just an example. The data points I have after scraping Booli (which is a Swedish housing site) and their exposed Graphql-API is the following:
I for sure have some feature engineering to do before starting to build some models, but from the screen shot above you can clearly see that I don’t have any feature that would in any case catch the price increase that we saw taking place under covid. Just to show you guys how the squared meter price has changed during the period 2013 - YE 2022
I like the idea of adding features for central banks lending rates, some average mortgage rates from the biggest swedish banks and the inflation rate. I will for sure se if the central bank and the swedish banks have some sort of API which exposes these data points on a regular basis
If I wanted to train the model with and without covid, I would also add a flag: CovidEraPrice : 1: True, 0: False.
I would definitively need to gather data of prices of properties during covid, to train my model.
May be I could even think of a more generic solution and name my new feature differently. Ideas of the name could be:
In_Pandemic
In_SpecialWorldEvent
Or something like that. And any time there’s another pandemic or another special situation in the world, I would feed the data with the flag = true.
But again: I think we’ll need to collect prices on properties during pandemic, and before pandemic. And even after pandemic. In fact, this makes me think: should we sophisticate even more that new feature? like:
SpecialEvent:
0: No special event
1: During special event
2: Getting out of special event
Just some rambling ideas I’ve had while thinking about this case.
My understanding is that @Christopher_Furu wants to do predictions that consider housing market not only in normal economic environments but also in abnormal environments like covid.
We know there are hidden forces that drive the change of house prices. The thing is, do we have features which relate the force and the price in the following way:
[ Hidden force ] → [our feature] → [housing price]
Do you have access to expertise/insight which can help you single out (see PS2) some apartments (by their location and size for example, or by other conditions), and then you can plot the time series of those apartments’ dynamic features (see PS1) alongside with the prices?
Raymond
PS1: for example, flat area is not a dynamic feature.
PS2: the purpose is to segment apartments by their response to the hidden forces. For example, low-price apartments may be affected less or more comparing with high-price ones because their sources of demand are different.
Just to be clear this is just an exploratory project that I’m creating using Azure Functions and an Azure Data Lake. The purpose I have is to try to make predictions on future objects without having to retrain my model quite often in this volatile housing market.
My initial thought was that sure I can train some model on historical data and get some kind of performance but that model would get quite bad in no time given the market volatility and the static features that I have at hand. I was thinking that, among many things, the sold prices for sure have some seasonality to them and some sort of trend where seasonality maybe could be modeled by creating a month-feature from “soldDate” or quarter feature from “soldDate”. When it comes to the trend - that is the base of this discussion if I have my thoughts in the right place or maybe I have confused myself big time hehe…
Anyways I managed to retrieve monthly inflation data from the Swedish Riksbank SOAP-api and managed to plot Average Sold Sqm Price in Sek vs Monthly Inflation here is the plot:
for the ones that are interested here is a correlation plot of the features at hand without doing any feature engineering. Just be clear CPIF is the inflation feature.
Anyways thanks for your help but I think I have to search for some dynamic feature that can capture the price trend but also I need to do some reading hehe…
I just want to make one suggestion that I hope you will consider in your future analtyics. We want to look at more data at once for a good statistics, but we sometimes want to look at less, for example, when different sub-groups of houses react to the market at different speeds or in different scales. We don’t want to mix them up because it can complicate things.
Hi @Christopher_Furu
In addition to what all Mentor said I thinks that there are two different ways to deal with it 1 You can conclude that when there is covid and this disease spreads, or correctly when an epidemic spreads, this leads to an increase in prices due to many things also, including an increase in interest in banks (interest on loans) or a lack of supply and demand, and there are many, many, and this is what is called feature engineering. So if you are interested in knowing what are the elements that affect that, you should sit down with, for example, banks, sales and leasing offices, etc.
2 if you thinks that for example the prices of houses is increased so much because Covid so It can be considered that house prices in the time of Covid were high, and this period did not last long, meaning that it is not a common occurrence, and therefore this data in the year of Covid can be considered not good and not accurate and that it suffers from bias