Why neural network is not working well with this case

I am trying to apply what i have learned so far to some real dataset on kaggle , in this dataset they give you some features and you have to guess either this passenger survived or not, I have tried logistic regression from course 1 and it have accuracy for about 75%. Now i try neural network as course 2, Here is my code.
This is what i am doing with my code:

  • first i read and featuring all the data, (ex: if it’s male then it’s 1 else it’s 0,…)
  • for some missing infomation i try to guess the reasonable number (with ‘nan’ in age i assume that person is 80 years old which i think will lower their survivability,…)
  • i apply tensorflow network model with a bunch of hidden layer with activation is ‘sigmoid’ or ‘relu’ (adding more layer or more epochs didnt improve the result)
    i dont do anything else like feature scaling, normalization,…
    it turn out to have accuracy for about 70% (best is 75%) which is lower than i expected.
    Is this the best result i can get without outside trick or can i do anything else with neural network to get better result ?

Hi @cpp219

  • when you dealing with null values I remember that the age feature is suffer from skew variance so that you can convert the distribution of the data to normal by for example taking the log of all values and before that if the distribution of the data isn’t normal the best way to fill the null values is to fill with the median
  • Also I didn’t remember well but I think that there are an column(feature) with more that 50% null values and for me I decide to drop this column as I wouldn’t venture guessing what values should be filled in for this feature
  • One of the best choice is to make the hidden layer with Relu activation function as it’s very effective for time and the complex calculations, in addition to many other things
  • My advice is to change the number of neurons of the hidden layer and try to make the number of neurons of the first layers is bigger and start to decrease it to the last hidden layers like first hidden layer you could set the number of neurons = 512,next layer 256 ,and next layer 128 …(advice to make the calculations more faster is to choose the number of neurons = 2 power number that fit the cores of the processor and make the calculations faster )
  • The scale of features is very different from each other so that the best choice to make normalization for the feature that help in overfitting and in the distribution of the data
  • You can use Principal component analysis(PCA) to decrease number of features as I found that the number of features is so big and also there are an features hasn’t a good correlation with the output(When I mean correlation I didn’t mean the physical correlation command I mean that in the live ) so when using PCA you will decrease number of features and also preserve information of these features
  • Try to check the distribution of the data, and change it to normal distribution

Feel free to ask any questions,
Regards,
Abdelrahman

thank you a lot for your help, here is what i have tried:

  • i have fixed the ‘nan’ in age to median and drop the cabin column which is 50% ‘nan’ as you said
  • i have fix my neural network , here’s my code (it’s short).
    the loss is decrease from 0.3 → 0.15 but the accuracy stay the same (sometimes it’s worse)

    after many try with neural network i see that if i let my network being too deep it will drastically decrease the accuracy (i dont know why) instead i remove some layers and add more units to one layer then my accuracy increase a little bit but if i add too many units then it dont change anymore (luckily it dont get worse) .

So i want to know if i can do anything else to fix it.

I just took the briefest look at your code, so maybe this missing the mark, but it looks to me that you are heavily using the test data set as input to your training. I don’t think this is correct. All model training against kaggle data sets must be on the train data set only. Your model shouldn’t even know the test data set exists. Since labels aren’t provided for the test data set, there is actually nothing you can use it for during training other than to ensure your model doesn’t blow up or take too long to run (some Kaggle competitions include a time cutoff) when used as input. Kaggle uses this data to evaluate your trained model, but you can’t and shouldn’t try.

You might also want to consider using native pandas data wrangling to clean up the dataset before training. No need for those explicit loops. See for example pandas.DataFrame.dropna — pandas 1.5.3 documentation

Here is a quick and dirty idea for cleaning up the data to give you a basic ‘rectangle’ of training inputs and labels…

#load dataset
data = pd.read_csv('./train.csv')
data.info()
#clean up and separate labels from training inputs
data = data.drop('PassengerId',axis=1)  #not intrinsic
data = data.drop('Cabin',axis=1)  #75% null values
data = data[data['Embarked'].notna()]. #only lose 2 rows
data['Age'] = data['Age'].fillna(data['Age'].mean()). #not convinced this is a good idea...test it
data['Sex'], uniques = data['Sex'].factorize()
data['Embarked'], uniques = data['Embarked'].factorize()


labels_df = data['Survived']
training_inputs_df = data.drop(columns='Survived')
training_inputs_df.info()  #DataFrame with 889 rows of not null data - the model training inputs 
labels_df.info()   #DataFrame with 889 rows of not null data - the model training labels

At this point I would consider the data ready for additional exploratory data analyses and would start doing some measurements and visualization. Then time to think about scaling, categorical vs numeric and implications on cost function, etc. Hope this helps.

PS: I separate out the Survived column since it will be convenient for setting up the neural net model, however you might benefit from using further built-in Pandas DataFrame functions to do statistical examination of the data. Here is another quick and dirty, naively converting categoricals to codes without thinking through the implications, just to get the analytics running…

#this uses the dataframe containing the Survived column since that is your prediction target
data.corr()

Sex, Pclass and Fare. Hmmm

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(training_inputs_df,
                                                    labels_df,
                                                    test_size = 0.1)

 
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
 
print(x_train.shape)
print(x_test.shape)

print(y_train.shape)
print(y_test.shape)

 
model = Sequential([
    Dense(units = 256,activation = 'relu'),
    Dense(units = 128,activation = 'relu'),
    Dense(units = 1,activation = 'sigmoid')
])
 
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
)
history = model.fit(x_train,
          y_train,
          epochs = 50)

results = model.evaluate(x_test,y_test)
print("Evaluation results: " + str(results))

Again, real quick and dirty. Relatively small model, 90/10 train/test split. Not too many epochs to avoid overfitting. Approximately 85% accuracy on the test split.

NOTE: the train/test split is built from the kaggle train data file…it does NOT use the kaggle test data file. You only use that if you want to submit your predictions for evaluation. Cheers

Final thought here is that you should probably do some additional EDA and see what might be skewing a Nn model. Check class imbalance (survival/non-survival is worth a look - how could you adjust sampling to make it more balanced?), column/ feature independence (are pclass, fare, embarcation port and cabin correlated? Could you drop one or more without losing information?) What about granularity of the age variable-do you need to keep specific age values, or would coarser categorical bins like child, teenager, adult be more indicative/predictive than exact age?). You are seeing why in real world questions machine learning and data science are two sides of the same coin. HTH

@ai_curious The Age feature suffer from skew so the best way to fill null values in median

data['Age'] = data['Age'].fillna(data['Age'].median()). #not convinced this is a good idea...test it

after that take log() for that feature to make the distribution of this feature is normal, But all you said is so good.
In addition to that @cpp219 you could add batch normalization layer & Dropout layer, also you could use Learning rate reduction that could help you that the learning rate minimize after some iterations so that it provide the convergence, or use Early Stop model that if your model start to diverge it will stop the model in these case you code add 2 additional layer like this code

model = Sequential([
    Dense(units = 512),
    tf.keras.layers.normalization.BatchNormalization(),
    (Activation('relu'),
    Dense(units = 256,activation = 'relu'),
    tf.keras.layers.Dropout(rate=0.2),            
    Dense(units = 128,activation = 'relu'),
    Dense(units = 56,activation = 'relu'),
    Dense(units = 1,activation = 'sigmoid')
])

in addition to doing more EDA and analysis for you data to retrieve and concentrate about the powerful features of your data

Cheers,
Abdelrahman

2 Likes

Agree. Not only is it skewed in the full population, the mean of the survivors (28.2) is lower than the mean of the non-survivors (30.6), so imputing using a single value, whether mean or median, isn’t a perfect option. It’s one example of where better knowledge of the data can help lead to a better predictive model.

2 Likes

thank you for helping me, it’s incrediblely useful to me btw i just use the test data to answer not training and can you tell me some useful ways (quick easy way or beautiful way) to visualize the data from your experience ?

1 Like

beautiful way is probably beyond the scope of what can be communicated here, at least by me, but I’ll hunt up some links for you to review. quick easy way is provided by Pandas itself. For example…

#impute reasonable values for missing Age values
age_mean = data['Age'].fillna(data['Age'].mean())
ax = age_mean.plot.hist()

See pandas.DataFrame.plot — pandas 2.1.3 documentation
for more about what Pandas DataFrame supports in the way of visualization.

It can be quite useful and enlightening to spend some time with measuring and visualizing population statistics on data sets like this one (as opposed to, say, pictures of cars or medical images) before diving into predictive models. 1) the model might not be as insightful as it seems to be or 2) you’ll end up going back to these analyses to figure out why the model learns slowly or doesn’t generalize, so why not do it on the front end.

BTW thanks for the feedback. It’s nice to know it helped.

Beauty is in the eye of the beholder (or their manager), but here are some ideas to explore …

https://seaborn.pydata.org/examples/index.html

Cheers

1 Like

yes you are right, also I hear another effective technique to fill null values with skew column is to divide the train data to ranges and in this case you could take the mean, and also the best way is to take median, but personally I didn’t try it :slight_smile:

Regards,
Abdelrahman