Looking for your opinions and advice on my step by step Titanic classification with XGboost Notebook

Wissam · May 25, 2023, 7:22pm

I have finished this course and to put what I learned in practice, I made a detailed step by step binary classification notebook tackling the famous Titanic problem and I would highly appreciate any opinions, advice, criticism or comments about the work I did.
Notice recently finishing this course, I took the time to look deeper into machine learning and planning to take more deepai courses, because the experience I had with them was really great!
Here’s a link for my notebook:

Please don’t hesitate to share any ideas that you have because I am looking forward to learning from all of you and I hope I can also help some of you learn too.
Thank and good luck everyone!

rmwkwok · May 26, 2023, 12:11am

Hello @Wissam,

It is a very organized, elaborative, and well-explained notebook! Thank you for sharing! As I read it, I was also curious so I did some checking on the data, and I had some different opinion to some of your points, so I am going to share them here:

Sect 3.6 - The box plot indicated that the 25 percentile is lower on the Survived side, so I did the screenshot checking and found that male kids also had higher Survival rate. Therefore, Age can also be indicative.

image569×611 19.3 KB
Sect 5.1.1 - Since XGBoost can handle missing value, I would try not to fill the missing values of Age since it has a significant portion (20%) of missing values. I would love to see if being missing there is indicative or not. From below, those with missing ages had lower survival rate.
Sect 5.2.6 and 5.2.7 - generally speaking, I think there is no need to bin for the sake of binning in XGBoost, because XGBoost can decide how to bin on its own. On the other hand, if we bin it for the sake of making interacting features, then it is different. Therefore, I probably would keep the unbinned Age and unbinned Fare in the dataset, and then keep also the interacting features.
Sect 5.3 - It would have been more robust if those feature manipulations on the training set were wrapped in functions, so that we could reuse those functions on the testing set.
Sect 7 - did you consider the scale_pos_weight parameter?

Cheers,
Raymond

PS: I moved this thread to “MLS Resources”

Wissam · May 26, 2023, 5:05am

Hi @rmwkwok !
First of all, wow! Thank you very much for this detailed reply that helps me tremendously!
Let me reply:

About my statement about age in section 3.6, point taken, you are right, thank you!
I understood at some point that XGBoost can handle missing values, but I didn’t think about leaving it as it is, that’s a great idea, I should test and see and I should keep it in mind. But I can also say that I deccided to start with xgboost, because of Andrew’s recommendation in the machine learning specialization, but he also talked about the power of ensembling and I was thinking about doing that too, like using logistic regression and random forest and ensemble them.
About the bins, I see your point too and I thought about that, I played around with keeping age and fare without bins and keeping the bins without age and fare and only at the end I did the feature interaction, so it I understand what you are suggesting about keeping age and fare and using bins for feature interactions.
about wrapping feature manipulations in a function, I didn’t think about making a function for these steps, because all the time on my mind, I was thinking about how to put them in a pipeline, I do know how to use a simple pipeline in python, but this would be the next question to answer for me: Is there a way to put all the data wrangling steps into a pipeline? If yes, how does it work? I did a little research, not enough tho, and most of what I found was people explaining pipelines with two or three steps, so this isn’t yet clear and I would love to learn more about it.
What are your advice and suggestions about pipeline use this particular notebook or in general.
No, actually I didn’t consider scale_pos_weight because in the process of looking for the most commonly used parameters, I didn’t see many people using it, does it make a good difference? I had the limitation of kaggle notebook running time of 12 hours, so I couldn’t put use all features. I saw some people do one hypermeter at time, which is not ideal, or at least that’s how I thought.
The goal of me trying to make it a step by step notebook and to detail everything is that in my research to learn from other people’s work, many of them didn’t use good reasoning to move on from one step to another, they stack all their code without explanation and it leaves learners hanging with no answers and no explanations which can be frustrating sometimes.
The machine learning specialization was a great start for me and I am panning to go for the deep learning specialization and other deepai courses soon, probably all of them, just trying not to get ahead of myself.

Thank you very very much again for taking the time to read my work and provide such a great reply, that’s what I was looking for, advice and help from people who have better knowledge and experience in this field than me. I highly appreciate your, your time and thoughts.

Cheers!
Wissam

rmwkwok · May 27, 2023, 1:43am

Hello @Wissam,

I could see a lot of hard work in your notebook, and with the results you have got so far, they are actually pretty nice baselines for you to see if things like “keeping missing as missing” or “scale_pos_weight” will make any further differences. We can read the rationale behind them and make judgement, but to tell whether a particular thing works in a particular dataset, only experiments will do. Similarly, whatever you have got in this particular dataset, there is no guarantee to be some universal truth in all datasets. This is why I cannot predict if this or that will work on your particular dataset for this time. Kaggle gives us 10 concurrent CPU kernels, so it is some resources to make the most of if we have multiple things to try out

I look forward to any other future updates from you.

Keep learning, and cheers,
Raymond

Wissam · May 27, 2023, 6:56am

Thank you very much and good luck to you too.

Topic		Replies	Views
Data preprocessing Supervised ML: Regression and Classification week-3	3	531	September 4, 2022
Intro to Kaggle Advanced Learning Algorithms week-4	5	625	August 3, 2022
Help to improve a high bias/underfitting neural network Advanced Learning Algorithms week-3	3	466	May 3, 2023
Any good kaggle links for practicing Improving Deep Neural Networks: Hyperparameter tun week-2 , week-3	2	215	April 19, 2024
XGBoost for house price competitions Advanced Learning Algorithms week-4	9	612	April 15, 2023

Looking for your opinions and advice on my step by step Titanic classification with XGboost Notebook

Related topics