I just took the briefest look at your code, so maybe this missing the mark, but it looks to me that you are heavily using the test data set as input to your training. I don’t think this is correct. All model training against kaggle data sets must be on the train data set only. Your model shouldn’t even know the test data set exists. Since labels aren’t provided for the test data set, there is actually nothing you can use it for during training other than to ensure your model doesn’t blow up or take too long to run (some Kaggle competitions include a time cutoff) when used as input. Kaggle uses this data to evaluate your trained model, but you can’t and shouldn’t try.
You might also want to consider using native pandas data wrangling to clean up the dataset before training. No need for those explicit loops. See for example pandas.DataFrame.dropna — pandas 1.5.3 documentation
Here is a quick and dirty idea for cleaning up the data to give you a basic ‘rectangle’ of training inputs and labels…
#load dataset
data = pd.read_csv('./train.csv')
data.info()
#clean up and separate labels from training inputs
data = data.drop('PassengerId',axis=1) #not intrinsic
data = data.drop('Cabin',axis=1) #75% null values
data = data[data['Embarked'].notna()]. #only lose 2 rows
data['Age'] = data['Age'].fillna(data['Age'].mean()). #not convinced this is a good idea...test it
data['Sex'], uniques = data['Sex'].factorize()
data['Embarked'], uniques = data['Embarked'].factorize()
labels_df = data['Survived']
training_inputs_df = data.drop(columns='Survived')
training_inputs_df.info() #DataFrame with 889 rows of not null data - the model training inputs
labels_df.info() #DataFrame with 889 rows of not null data - the model training labels
At this point I would consider the data ready for additional exploratory data analyses and would start doing some measurements and visualization. Then time to think about scaling, categorical vs numeric and implications on cost function, etc. Hope this helps.
PS: I separate out the Survived column since it will be convenient for setting up the neural net model, however you might benefit from using further built-in Pandas DataFrame functions to do statistical examination of the data. Here is another quick and dirty, naively converting categoricals to codes without thinking through the implications, just to get the analytics running…
#this uses the dataframe containing the Survived column since that is your prediction target
data.corr()
Sex, Pclass and Fare. Hmmm