Kernal keeps crashing when training on full dataset on local machine

Jairaj_Mathur · October 6, 2022, 2:05pm

Hello all, I have written my own model based on the brain MRI segmentation model to go through the complete Brats2020 dataset. My kernal keeps crashing after 3 training examples. I made the sampling volume really small, and it still happens. im training it on a TeslaV100, 32 GB. System RAM is 256 GB. How can I get more help regarding this? Can I put my jupyter notebook here? @Mubsi @nakamura @canxkoz @andres920310

edit2: I have tried to train on just 3 images, and it still crashes.

gent.spah · October 7, 2022, 8:28am

It might be crashing because of the size of the dataset I guess, these frameworks and not only them, the RAM on the PC itself creates buffers of data and when the data is too large and certain operations are happening faster than others then there might be a crash.

Just my thoughts here.

Maybe if you search and possibly if you use TFX from tensorflow on pipeline’s creation for ML, it could be helpful in processing large amounts of data.

andres920310 · October 19, 2022, 8:35pm

Hi @Jairaj_Mathur! I think the problem is that you may be processing whole images. One problem that arises from 3D images (as MRIs) is that the required size basically explodes. When you are working with convolutional neural networks you have to load into memory/GPU not only the images but also the activation maps of each layer in your network. That’s why it is crashing with few images.

Something that is very usual when training on medical images to address this issue is to process batches of image patches, rather than whole images. For instance, nnU-net (which was for a very long time the SOTA for medical image segmentation) fixes a patch size at training (see the paper) and at inference, predictions are stiched to have the full segmentation mask. Also, the batch sizes are smaller in contrast to natural images. I have worked with 16 GB RAM and 12 GB GPU and I have been able to train for a similar dataset (LNDB). The size of your resources will just speed up your training time.

I hope this is helpful

Topic		Replies	Views
Can't finish training the model Convolutional Neural Networks in TensorFlow week-module-1	5	545	August 23, 2022
The kernel appears to have died. It will restart automatically Convolutional Neural Networks in TensorFlow week-module-3	1	631	July 4, 2022
The kernel appears to have died. image-segmentation-with-u-net Convolutional Neural Networks coursera-platform	1	511	April 20, 2023
Week 3_Assignment 2_ "Image Segmentation with U-Net" Convolutional Neural Networks week-module-3 , coursera-platform	1	267	February 24, 2024
C1_W3_Assignment 4.1 Training on a a Large Dataset AI for Medical Diagnosis week-module-3	7	403	December 6, 2023

Kernal keeps crashing when training on full dataset on local machine

Related topics