Distributed training - how exactly is model "split up" while using model parallelism?

In the video, the instructor mentions, for model parallelism, the model is split across different nodes/compute instances on the cluster. But, I can’t wrap my head around, how that is implemented. For example, I have a CNN with 5 convolutional layers, 2 FC layers and last layer for classification. How can you “split” this architecture on different compute instances?

Do we provide the same data to each of these “split” architectures?

1 Like

Hi @bagyaboy,

The AlexNet model for the ImageNet challenge is designed Krizhevsky. At that time they were using a GPU less powerful and low memory than 1080. The AlexNet has a huge number of parameters due to the final fully connected layers. So, the entire model can’t be fit in a single GPU. The hack is to use the grouped convolution. To understand how they implemented model parallelism using grouped convolution in brief, you can check this article.
This is not the only way to achieve model parallelism but one of the ways. I believe it’s still an active area of research.

Best Regards,
A. Sriharsha

@sriharsha0806 Thanks for the response. So, if I want to use a more recent, or more complicated, model like EfficientNet or Bert, which will ( I assume) not fit in memory, do I have to implement such ideas within the architecture? Or does Sagemaker take care of these for me?

Can you please point me to some ideas for this?

Hi @bagyaboy,

You check this documentation on how to implement model parallelism using sagemaker. As I mentioned before It is an active area of research. You can start with this paper. Each method has its own pros and cons but some methods might perform better than the previous methods. Best eg is pytorch lightning Model Parallelism training is first one I came across which uses sharding concept. This concept reduces the memory overhead per GPU.

Best Regards,
A. Sriharsha

3 Likes