Kernal keeps crashing when training on full dataset on local machine

Hello all, I have written my own model based on the brain MRI segmentation model to go through the complete Brats2020 dataset. My kernal keeps crashing after 3 training examples. I made the sampling volume really small, and it still happens. im training it on a TeslaV100, 32 GB. System RAM is 256 GB. How can I get more help regarding this? Can I put my jupyter notebook here? @Mubsi @nakamura @canxkoz @andres920310

edit2: I have tried to train on just 3 images, and it still crashes.

It might be crashing because of the size of the dataset I guess, these frameworks and not only them, the RAM on the PC itself creates buffers of data and when the data is too large and certain operations are happening faster than others then there might be a crash.

Just my thoughts here.

Maybe if you search and possibly if you use TFX from tensorflow on pipeline’s creation for ML, it could be helpful in processing large amounts of data.

1 Like

Hi @Jairaj_Mathur! I think the problem is that you may be processing whole images. One problem that arises from 3D images (as MRIs) is that the required size basically explodes. When you are working with convolutional neural networks you have to load into memory/GPU not only the images but also the activation maps of each layer in your network. That’s why it is crashing with few images.

Something that is very usual when training on medical images to address this issue is to process batches of image patches, rather than whole images. For instance, nnU-net (which was for a very long time the SOTA for medical image segmentation) fixes a patch size at training (see the paper) and at inference, predictions are stiched to have the full segmentation mask. Also, the batch sizes are smaller in contrast to natural images. I have worked with 16 GB RAM and 12 GB GPU and I have been able to train for a similar dataset (LNDB). The size of your resources will just speed up your training time.

I hope this is helpful :smile:

1 Like