Andrew says that the bottleneck layer reduces memory requirements. But it seems the intermediate activations which are larger have to be saved too. So how exactly does it reduce memory reqs?
The intermediate activations are not saved; instead the bottleneck is treated as a single operation. The authors of the MobileNetV2 paper write the following:
“if we treat a bottleneck residual block as a single operation (and treat inner convolution as a disposable tensor), the total amount of memory would be dominated by the size of bottleneck tensors, rather than the size of tensors that are internal to bottleneck (and much larger).”
But is it dominated by the bottleneck layer at training time?
Here’s my two cents.
As the implementation notes state about memory use by a bottleneck layer:
“the amount of memory is simply the maximum total size of combined inputs and outputs across all operations.”
So the inputs and outputs to a bottleneck layer as a whole determine the memory load during bottleneck computations. Whether this dominates during training depends on the overall specific set up of the network and the organization of processing during training. Table 2 in the paper presents the sequence of layers that include many bottleneck layers, so it is likely that these will have a large impact. But the conv2D layer that follows has a larger input and output volume that memory should allow for. So it may depend on the setup of the training process.