I hope the topic name is not too genral. I completed the MLS course and wanted to start on a small data science project to foster my knowledge.
For this I took a dataset of my local city which contains different bicycle counting stations that count the amount of bikers per hour. I downloaded the last 10 years and also got the matching weather data.
My idea is to finally train a model to forecast depending on time, day and weather the amount of bicyles.
While exploring the data I struggle with the question, if I should impute the missing data in the bicycle dataset. The missing data ranges from 0% to 5 % per counting station so it is not much and I am still left with many data points to train a model.
So my specific question is:
What is the benefit of imputing the missing data? The number of bycicles are (as I understand it in this case) my target value. So should I impute the target value? That seems odd as the target value is the ground truth value and imputing is basically just a guess.
If I should impute the values, what would be the best way to do it? I was thinking like taking the average on that day of all the other years (or similiar).
Or should I not impute it? In this case why not?
In case I am just looking at the analytical part (no ML), does imputing make sense e.g. to better visualize and create trends?
But then again what additional information do I get from imputing, if I e.g. just take the average of other years? It could still be completely wrong for a certain day.
Hi @M_R2! Great to hear that you are working on your own data science project to foster your MLS knowledge.
In your case, my recommendation is not to impute the target values as your missing values are very small (to 5%). As you mentioned, the target value is what you are trying to predict and imputing it could lead to inaccurate predictions.
However, if the past pattern is a good predictor of the future, then, I believe, you can impute the output values. But keep in mind that we must have the same conditions (for example, working days have large bike counts, right?). So, if you have missing data for a working day or a holiday, you can fill it in with past work or holidays. Make sense, right? If this is not the case, imputing the missing data could lead to inaccurate predictions.
If you decide not to impute the missing data, this could still be a reasonable approach as you have small missing data and I don’t think it would significantly impact your analysis.
Thank you @saifkhanengr for your answer and opinion.
I got a follow-up question.
Let`s say I decide to impute the counted values, what additional information do I introduce here? If I impute by let’s say taking the average of other days (e.g. working days, if a working day was missing) then to me this is not new information because I just fill in the data of already existing other days which are known to the model. This could be completely right or wrong, but it is not a new information.
On the other hand, if I have certain features like temperature and rain for a certain row, and lets say temperature is missing but I still want to use rain then I see the point for imputing temperature so I can still use this row in my model and therefore introduce the new rain information even though temperature might just be an average off other values. In the first example I fail to see the benefit of that.
You are right that imputing data does not introduce additional or new information but rather enhances the existing dataset by filling in missing values.
However, it is important to keep in mind that imputing temperature based solely on rain may not capture all the necessary information. For instance, winter rain typically has a lower temperature than summer rain, right? Therefore, imputing temperature based only on the presence of rain may not be sufficient and other factors may need to be taken into consideration.
Ah sorry, I did not meant that I want to impute the temperature based on rain. It was more an example where I think imputing would make sense.
E.g. you have a feature vector with temperature, rain, snow, windspeed,… but for certain rows of your dataset the temperature is missing. Then I understand that imputing the temperature (with whatever strategy) makes sense in order to not drop the row since the other information (rain, snow,…) is still useful and therefore imputing the temperature gives you the opportunity to also use this row in your model.
This is in contrast to imputing only the target value where I do not see the actual benefit as there is no new information introduced.
But your first statement about enhancing the dataset seems to make sense. Even if I do not introduce new information with this, it could still improve the dataset I guess.