Hello everyone. I am currently doing the Titanic Survival dataset in Kaggle to practice. Several features in the training and test data set have null values (ie. Age, Cabin, Embarked). My initial approach would be to impute the null values with the mean. However, I watched several videos on youtube that dropped the “Cabin” feature because the total null value percentage is 77%, and said that imputing the values may result in overfitting since theoretically, 25 to 30% is the limit for null values. Is there another way to approach this without dropping the feature? Thank you in advance.
Welcome to our community.
Typically, we impute the null values - You have directly used the mean to fill in the null values. Another technique would be to look at some of the other features and use that information for imputing the null values.
One such technique would be to group some of the other features, finding the mean of the specific groups, and then checking which group each of the null values belongs to and using the respective mean of that group to fill in the missing null value.
In certain cases, it might be even possible to predict the missing values of a Feature, as opposed to simply imputing (replacing with an estimate).
However, it must be noted that 77% is quite a high number and filling such a high percentage of data with guess values could impact the overall accuracy of the model as well.
I think that if you have so many null values it is a big problem as they also suggest. First before dropping you need to know how important is this feature to the label. If it is very important you can try and find the missing values from some other source if possible, if not at least find the governing relationship that can produce this feature to a good approximation, if that is possible of course.
Thank you very much for the response and advice.
While browsing earlier, I saw a method wherein they converted the whole column to binary numbers. They converted all observations that contained a value to 1 and all Null observations to 0.
Another approach is they grouped the column based on another feature (e.g., Passenger Class). And then, they populated the computed mean value of each group to the corresponding null values.
Would these be reasonable methods?
Thank you very much for the response. I did find several approaches that might be applicable to these types of datasets. You can refer to my reply to shanup for more details.
In this particular case, i feel the 2nd method would be more suitable than the 1st.
Coming to the 1st method, there seems to be no rationale in converting all the nulls to zeros…unless there was some evidence that the zeros were wrongly represented in the dataset as null.
We can replace the nulls with whatever values we want. But our aim should be to come up with a strategy to replace the nulls with plausible values. If we are not confident about our strategy, and if the % of null values is very high (as in this case), maybe the model might be better off without this feature.
If there is some evidence that this feature is of high importance then it might be worth the effort to try and replace the nulls…but 77% missing values is still a little too many to replace…and unless we can find suitably accurate replacement values it would not be of much use either.
Correct replacement of values could help improve the accuracy of the model. Incorrect replacement of values, on the other hand, could even bring down the accuracy of the model.