Let’s say, I’m running 10,000 exampels in mini batches of 100 through a forward pass in a Deep network. Then in this configuration, what would be the unit of parallelization ?
Will each mini-batch be run in parallel ?
Will each example in each mini-batch will also be run in parallel ?
My general intuition would say that both are true. That is each mini-batch is run on its own, but each example would also be run in parallel since most loss functions can be distributed additively over each loss.
However, one exception which will break rule 2. parallelization argument is introduction of BatchNormalization.
Yes, modern frameworks all support both vectorization and parallelism. Your best bet is to start from the documentation of your “platform of choice”. We use TF here, of course. In just a quick search for “parallelism” on the TF documentation site, one finds:
That article links to this page that talks about things at the level of GPUs.
I’m sure you can find more. If you are using PyTorch or some other platform, please do the equivalent search.
To address your specific questions:
My guess is that this does not happen in general. The point of minibatch mode is that you want more frequent updates to the parameters, so that the learning happens more quickly. So you want to finish the first minibatch and then do the “update parameters” step resulting from that before proceeding to the second minibatch.
Yes within each minibatch, but that happens through vectorization, rather than multithreading. I guess you could say that vectorization is just the most basic form of parallelism. Then you can layer multithreading and distributed processing as further degrees of parallelization on top of the basic GPU vectorization.
This thread reminded me that I recently reread the Krizhevsky, Sutskever and Hinton paper ImageNet Classification with Deep Convolutional Neural Networks aka AlexNet, in which they detail their approach to training on two GPUs circa 2012.
The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU.
To date, that paper has some 150,000+ citations
I am under the impression that prior research on CNNs had only used single GPUs.