To whom it may concern,
I am trying to experiment with building a Feedforward NN for a binary classification task on an imbalanced dataset. The size of the training data is ~80000 points and 21 features, most of which are binary. The model I have constructed has 3 hidden layers with 32, 32 and 16 hidden units. I am observing the following things, which are hard to explain:
- Relatively low training accuracy (~85%), high dev set accuracy (~97%), high test accuracy (~92%)
- ROC AUC values are about the same for train and test set (~88%), and in the dev set its ~93%.
- Training and dev set accuracy change very little despite the training time; specifically, training set oscillates very slightly around the above value (± 0.01) and dev set accuracy just remains constant.
Can you please share any thoughts on what I can improve in my model? I am not sure if there is an error, and how to diagnose it in this case.
Thanks a lot in advance!
Endrit
Pershendetje Endrit,
Here are my thoughts on these points;
-
I am not sure the accuracy might be the best metric for imbalanced datasets maybe its better to use precision , recall, F1…
-
its probably because of the imbalance and also because the test, val and dev set are not from the same distribution because you are using a small part for testing from training set which is not probably representative of the entire dataset, maybe shuffling of the sets can help, there are other advanced techniques in tensorflow extended TFX to check schemas.
-
On top of the above you might need to build a more complex network or even use transfer learning on a pretrained one.
Perhsendetje Gent,
Faleminderit per pergjigjen
-
I agree that accuracy is not a good metric for imbalanced data, and I am also tracking ROC AUC curve precisely because of that. I am not sure how to diagnose the fit if I consider precision/recall/F1 score. Would it still be the same as with accuracy? For instance, there is underfit if precision/recall/F1 in training set and dev set are both high; there is overfit if precision/recall/F1 score in training set is high, but have low values in dev set?
-
Indeed the train set and dev/test sets have different distributions. I tried to augment the training set with other available datasets. What techniques are you referring to? Can you please share a link/tutorial?
-
Even if I increase the number of layers and nodes/layer, the training set performance does not improve much. I am currently using batch size 10, adam optimizer with default learning rate and training for 40 epochs. How would you determine how complex of a network is needed in this case? I have not been able to find any pre-trained network for my application.
Thank you again for your time/help
Gjithe te mirat,
Endrit
You could find videos referring to the precision/recall/F1 in the deep learning specialization, the Natural Language Processing Specialization as well as far as I remember and also the Machine Learning course. These kind of metrics give good insights for unbalanced datasets. Generally speaking it should be true that if score is low in testing and high in training should be overfiting (here is video for these metrics: Lecture 11.4 — Machine Learning System Design | Trading Off Precision And Recall — [Andrew Ng] - YouTube).
You have a lot of data points there should be no need to augment it I think, just make sure there is shuffling on the dev and test sets, so that both include data with similar characteristics (the model needs to see all kinds of occurrences in training so it can learn). For example if you have a tf.Dataset there are some tips here Performance tips | TensorFlow Datasets, but you might need to search for your own arrangement.
And I think now your data are not preprocessed which means you need to work with them to even remove features that do not have any contribution to prediction, you might also need to create other features as well from combining some of the features. This is a lot of work, there are some nice courses on kaggle how to work with data. Maybe a deep learning model is not the the best for this case, they use a lot of random forests over there, have a look but NN’s are a tool that can solve many problems of course…A more complex network could help but I am thinking that your data is not preprocessed because you say you have 21 features (some of them could be redundant). This might be the crux of your problem.
In terms of pretrained model you could check tensorflow hub models, you might find something there for your case, but first I think you should concentrate on data preprocessing which I think its not done.
Many thanks, Gent, for all your comments and suggestions! I took some time to think about what you mentioned. Please see the following:
-
I am familiar with the concepts. I just was not sure whether they can also be used to debug the neural network in the same way Prof. Andrew explains here (https://www.coursera.org/lecture/deep-neural-network/basic-recipe-for-machine-learning-ZBkx4) about the error.
-
The original dataset is quite small (~15000 points). I augmented the training set with many examples from other similar datasets; i.e., it now contains more difficult examples to learn from. I am using Keras, and therein I am setting shuffle = True.
-
I checked the examples on Kaggle, but I did not see anything different from what I am doing. I have already derived additional columns (e.g., BMI), encoded the categorical variables (one-hot encoding), scaled the data (MinMaxScaler), and imputed the missing values (KNN imputer).
- To test my implementation, I trained on a subset of the data
of size 1000, and used the same subset also for validating.
My training accuracy and ROC AUC are both ~99.96,
i.e., my NN overfits the training set. Similarly, the validation
accuracy and ROC AUC are ~99.94%. So, I guess that at least
this should assert that the implementation is correct.
Could it be that the behavior is because training set is just too difficult for the model to learn and I just need to train longer? Or, do you still think that trying a bigger model would be more helpful?
Shume shume faleminderit edhe njehere! Te jam tejet mirenjohes per ndihmen
Endrit
My concerns would be when you expanded the original dataset from 15k to 80k it might be the distribution of data is not the same, unless you checked that already. Also why are you training and testing in just 1000 dataset why not the entire dataset divided accordingly to train dev test. Hypothetically speaking the NN network will fit the data it has been trained to, unless the data distribution is very strange, which should not be the case if right features are used.
If you think the datasets and their subsets come from the same distribution and features used are good features, there should be no reason that a NN could not learn from training and have good predictions afterwards. (you could use a bigger batch size, epochs could be increased but it can be seen if there is no improvement on learning there should be no point in training longer).
If the data is right then you should concentrate on the model and training it longer. I think if you search you might find some pretrained model, or you can built a more complex one using batchnorms, dropouts, residuals, more layers etc. Bottom line is the NN is powerful enough to learn many dependencies and will give good predictions to similar scenarios.
Te uroj suksese, keto jane mendimet e mia