Issues while trying to use a simple Linear Regression on Titanic Competition on Kaggle

Hello there

I’m new to the ML word, and i’m following diligently the course of AI specialization after studying “Ai For everyone”

At the end of week 1, we learned about linear regression and Gradient Descent, so i thought that i’ll give a try to one of the most famous Kaggle competition : Titanic

I was hoping to submit my results while trying to predict the survivors with only one feature : the ticket class which seemed like a poor idea, but i wanted to give it a try :slight_smile:

I imported the CSV thanks to this course
Created 2 simple function : one called creer_gradient_derive(x,y,w,b) and one called
descente_gradiente(x_train, y_train, alpha, num_iter) where w and b are setted to 0 inside (the name of these 2 functions are in french sorry)

When i tried to launch the code, i have this error message

/var/folders/17/lx_03r5n7sz1dnwklfw0vbxm0000gp/T/ipykernel_42390/230204310.py:10: RuntimeWarning: overflow encountered in scalar add
  dj_dw = dj_dw + dj_dw_i
/var/folders/17/lx_03r5n7sz1dnwklfw0vbxm0000gp/T/ipykernel_42390/230204310.py:11: RuntimeWarning: overflow encountered in scalar add
  dj_db += dj_db_i
/var/folders/17/lx_03r5n7sz1dnwklfw0vbxm0000gp/T/ipykernel_42390/230204310.py:7: RuntimeWarning: invalid value encountered in scalar multiply
  f_wb = w * x[i] + b
(np.float64(nan), np.float64(nan))

I’m not a developer but I’m learning python in the past and i’m enjoying diving in the code and understanding ML more than i though :smiley:
Still, i’m facing some issues that I can’t seem to be able to understand the issue after a search on the internet like on stackoverflow

Does anyone can help me to sort it out please ?

  1. If you’re building a neural network it’s best not to initialize all weights and biases to 0 for symmetry breaking
  2. It’s unclear if your implementation of gradient descent is correct. Even if it’s correct it’s possible your gradients, updated weights and/or updated biases are out of the range of (-1.7 \times 10^{308}, 1.7 \times 10^{308}) because of ‘exploding gradients’ - if your initial weights/biases are extremely far from the optimal weights/biases, the gradient becomes extremely large - this leads to extremely large weight/bias updates that are even futher away, which further leads to larger gradients, larger weight updates that are even further away, …

Please post a screenshot of plotting the loss vs epochs.

2 Likes

If “Who survived?” is classification (True or False for each passenger), then it’s not a linear regression problem. Instead it’s classification.

Classification uses logistic regression, which calls for a different cost function and gradients.

Re: initialzation:
Linear or logistic regression works perfectly fine if you initialize the weights and biases to zero, because these simple regressions aren’t neural networks, and they do not have a hidden layer.

1 Like

I’m not an expert on this dataset, but I believe using this particular column is not going to be enlightening regardless of other issues with the model. This is because ticket is just the ticket number for the passenger. They are essentially sequential big integers: 363272, 363273, 363274…

These are not informative or predictive in any way. If you want to build a simple, 1 variable model, I suggest try age or sex or pclass. [Maybe pclass is what you are using already, and not ticket ?]

There are some known (intentional) data quality issues/challenges in that data set - missing or null values, class imbalance, categorical values, etc. If you haven’t already done so, I recommend spending some time with an Exploratory Data Analysis (EDA). It is a best practice before building any kind of machine learning model. There are many examples out on the interweb, here are just two. No recommendations or endorsements…

Notice these two both ignore ticket.
They also each use available libraries of ML algorithms for the modelling. It’s a good learning exercise to write your own, but probably also really useful to start to understand what toolkits are out there to plug and play.

Bon voyage

1 Like

First, Thank you all for your help and answers !!

I’m not sure how to do it yet, but here is the code

here is the code
import pandas as pd
df=pd.read_csv('2024-11-23-Pclass_survived - train - class-survived.csv') #df = dataframe

[details="display of csv"]
![image|690x396](upload://gmllRK8M77bVCSxYICFmZafkXY9.png)
[/details]
x_train = df.iloc[:,1].values #prendre la première colonne du tableau en entier
y_train = df.iloc[:,-1].values #prendre la dernière colonne du tableau en entier
m_train = len(x_train) #vérifier que l'on peut compter le nombre de ligne avec length len(x) (ou y len(y) en fait)

def creer_gradient_derive(x,y,w,b):
    dj_dw = 0
    dj_db = 0
    m = len(x)
    
    for i in range(m) :
        f_wb = w * x[i] + b
        dj_dw_i = dj_dw - (f_wb - y[i]) * x[i]
        dj_db_i = dj_db - (f_wb - y[i])
        dj_dw = dj_dw + dj_dw_i
        dj_db += dj_db_i
    
    dj_dw = dj_dw / m
    dj_db = dj_db / m
    return dj_dw, dj_db 

def descente_gradiente (x,y, alpha_input, num_iteration):
    w = 0 
    b = 0 

    for i in range (num_iteration) :
        dj_dw, dj_db = creer_gradient_derive(x,y,w,b)
        w = w - alpha_input * dj_dw
        b = b - alpha_input * dj_db
    return w,b
    
 descente_gradiente(x_train, y_train, 1.0e-2, 10)
error message
/var/folders/17/lx_03r5n7sz1dnwklfw0vbxm0000gp/T/ipykernel_58917/230204310.py:10: RuntimeWarning: overflow encountered in scalar add
  dj_dw = dj_dw + dj_dw_i
/var/folders/17/lx_03r5n7sz1dnwklfw0vbxm0000gp/T/ipykernel_58917/230204310.py:11: RuntimeWarning: overflow encountered in scalar add
  dj_db += dj_db_i
/var/folders/17/lx_03r5n7sz1dnwklfw0vbxm0000gp/T/ipykernel_58917/230204310.py:7: RuntimeWarning: invalid value encountered in scalar multiply
  f_wb = w * x[i] + b
(np.float64(nan), np.float64(nan))

You are right ! but for the sake of the exercise, i’m trying to imagine "what if the feature Pclass to try implementing what i’ve learned as I learn it in order to see an evolution in the prediction X)
I try to reduce the number of iteration, instead of 10000, i went to 10 but i have the same issue

I’m using Pclass, sorry for the confusion

Yes! i saw that while looking into the data ! So i thought to myself that i should use one featue which is always present :slight_smile:

1 Like

Not sure why, but the screenshots aren’t visible. Please reply to the DM, I will try to solve the issue separately and post the solution here - otherwise this may take multiple back and forth comments.

1 Like

That does not address the issue, that you’re using an incorrect method.

1 Like