I have a question about size of training samples. For example in ECG cases, for normal ECG graphs, they all look very similar. If I only have tens of these images and ~10 images of abnormal ones, is that enough to train the CNN model? If not enough, can I just copy or augument them to artificially increase the training sample (say 1000)? I understand normally it requires large training data in order to get higher accuracy. am I correct?
I recently did a few experiments to try to answer this question for myself and wrote about it here:
My admittedly unscientific approach suggests order of magnitude 10^4 images is where there is enough data to dominate hyperparameter and architecture choices and not require augmentation. If you have only 10’s, 100’s or even 1000’s of training images I think you are going to have to do some extra work to try to get generalizable results. I am hoping others will weigh in since it is something I am actively working on.
Hi @Yixuan_Li-Verdoold great question. I would say more data is better but the quality of the data matters a lot. You need to have enough examples on what the possible cases are in order for the model to be able to learn, in medical images it is difficult to create synthetic data since you need to capture a lot of different scenarios that might arise. The accuracy increase because the model has more scenarios that is able to identify patterns, at this moment there are teams working on generating synthetic data for medical topics, but still there is a long way to go. You can check some of their work here