3D CNN optimal architecture

Hi - I work with MRI data, so am interested in 3D CNNs. I’m wondering whether anyone has properly optimized the network architecture for 3D data when (unlike 2D), the amount of training data is relative limited.

For example in Volume Segmentation with the 3D U-net, the number of channels goes from 3 to 32 to 64 before max pooling. I might have expected that in 3D space, we would need more of channels to represent (for example) the greater number of possible edge directions.

Similarly, the choice of patch size (160x160x16 in the C1W3 example) seems quite arbitrary. Given that the input data is 240x240x155, this is essentially only taking random sets of 16 slices.

Another observation for brain data is that we can transform brains into a fairly standard space using simple affine transformations (as well as more sophisticated non-linear warping). Can we use this standardization to help the NN? Unlike many CNN applications, we don’t necessarily need (or want) translational invariance.

I’d be delighted to hear from anyone with suggestions, and maybe even collaborate on some analyses!


Hi @Richard_Watts! You are addressing some interesting points in this thread. I will try to answer them from my experience working with vision problems in natural and medical images. As a disclaimer, there are lots of practices in the deep learning community that comes from experimenting and having good results, and then there are some theories explaining why those results were achieved; so, some of the answers I’ll give you are mainly based on why these approaches are used (as far as I know).

  • The number of channels is a way to increase the expressiveness of your model. They typically are expressed in powers of 2 (because of computational performance) and increased by a factor of 2 each time there is a size reduction (also usually done by a factor of 2). This idea comes from VGG.

  • 3D Images occupy much more memory space than natural images (given that extra dimension). That is why we process them in patches rather than the full image. When working with patches, we must be extra cautious as we can introduce some artifacts on the boundaries. This is why, at evaluation, we use overlapping patches and stitch the output to reconstruct the full image. Regarding the size of the patch, it is arbitrary. However, when slicing images (or generally storing arrays), it is recommended to have them expressed in dimensions that are powers of 2. Even if we are taking these random patches of arbitrary size, if these patches are big enough to capture the phenomena we are interested in (e.g., anatomical structures of organs or tumors), they should work fine.

  • I have seen that using affine transformations for uniform spacing in the images works well. This way, the filters learn on the same scale for all training observations. Also, another thing you mention is the appropriateness of some transformations. In the medical field, I think those image transformations should be taken more carefully so that they give plausible training examples

I hope my answer helps you clear things up :smile:

Regarding the limited training data available in medical images, maybe @mf.roa can shed some light.

Hi @Richard_Watts, regarding limited training data in 3D problems, which is particularly evident in medical data, you can use data augmentation techniques that consider only augmentations that make sense in the context you are working on. For example, in medical images, it does not make sense to implement mirroring techniques or any methods that corrupt the anatomical shape and location of the organs. However, you can use variations in intensity and gamma to augment this kind of data. Other approaches have been developed using generative models to create artificial data by learning the distribution and characteristics of existing real data. I hope it helps!