Hello.
I have about 70 images and am asked to develop several deep learning models for a binary classification problem (e.g. cat vs no cat).
so I have 27 no cat images with label 0 and 44 cat images with label 0.
I have used a train_test_split() to split these into the test and train datasets with the validation size of the test dataset of 20% :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(all_images, all_labels_binary, random_state=0, test_size = .20)
In my test dataset, I have only three values with label 0 and 12 values with label 1. however, I can use the stratify = … parameter to maintain the proportion of 0 v 1 labels.
However, I am now worried if I go on to train a CNN model, the datasets are simply are not large enough to generate any meaningful results.
If I am to obtain more images, how many would you recommend? I’m getting these images from my camera, as it is a bring your own data project.
Thank you.
Stratification is correct.
What’s the baseline when using transfer learning taking class imbalance into account?
what do you mean? I have used
X_train_images, X_test_images, y_train_labels, y_test_labels = train_test_split(all_images, all_labels_binary, stratify= all_labels_binary, random_state=0, train_size = .80)
to maintain the same propotion in the test set.
thank you
I now actually have a different problem. it runs out of memory with this model architecture (too many parameters in the flatten() layer. maybe I should post it as a different question…
my original question remains though - should I be using more images?
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 224, 224, 32) 2432
_________________________________________________________________
conv2d_1 (Conv2D) (None, 224, 224, 128) 36992
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 112, 112, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 1605632) 0
_________________________________________________________________
dense (Dense) (None, 128) 205521024
_________________________________________________________________
dense_1 (Dense) (None, 1) 129
=================================================================
Total params: 205,560,577
Trainable params: 205,560,577
Non-trainable params: 0
Regarding stratification, I meant you were on the right track.
Deep NNs always benefit from more good quality data. If it’s easy to collect more data and you have access to a lot of computational resources, nothing bad will come out of it.
It’s impossible for me to tell you an exact number on how much data to collect.
I recommend you take up deep learning specialization to get a better handle on your task, especially on transfer learning.
I have done the week on Object Detection and CNNs from this specialisation, and am still not sure about my practical example.
My images are either one man or two men who are still and sitting on a chair vs an empty chair by a thermal camera (there is a very clear outline of a man if one is present).
I have a perfect validation accuracy with the CNN model architecture shown above and two of the variations with batch_normalisation and dropout layers.
I am not sure if this is because there is a limited number of images?
If your train / test distributions match real world inputs and your model performs well in the test set as well, you are good to go. If you have only a few images, I recommend using transfer learning rather than train a model from scratch.
yes, they’re real world inputs from a real camera. thank you!
You’re welcome. One little detail since I’m unsure if we’re referring to the same thing. For training / validation, I recommend using RepeatedStratifiedKFold since you have few data points.