I can make a few comments, but I should start with the disclaimer that the MobilNet material is all brand new in the “refresh” version of the courses released in April 2021. So I haven’t lived with it for very long yet. I just went through it quickly back in April to be able to answer questions on the assignment, which doesn’t really deal with the actual details of the MobilNet model and treats it as a black box. So I don’t really know anything about the details of MobilNet itself. So here are a few things I can think of that are relevant to your questions:
-
The comment comparing the architecture to residual nets doesn’t seem to really help understanding how things work here. If you compare the two architectures at the level of the diagrams, the skip layers seem to be the main point of comparison. In the MobilNet case, they skip between the (apparently linear) bottleneck layers. In residual net, you have no bottleneck layers and the skip layers can either preserve or convolve the input to output dimensions, skipping over several pretty flavor vanilla conv layers. So they are different and I don’t see how one is the inverse of the other. Doesn’t seem like that is very helpful, but that probably just means I’m missing the point.
-
A bottleneck is just a standard English usage that refers to some sort of pathway that shrinks down in size. E.g. you’re on 4 lane road and it necks down to 3 lanes and then 2 lanes. That’s a “bottleneck”. Here they just mean that they are reducing the dimensionality from input to output. The interesting bit is that each full composite layer expands and then contracts. So intuitively they are taking a smaller input, doing something more complex to analyze it and then distilling it back to a lower dimension after just two convolution layers. Of course the “magic” here is that there are coefficients at every single layer and those are learned through back prop. So it either learns something interesting and useful or it doesn’t. They seem to say that (if properly configured) it can learn something useful with this architecture.
-
The statement that non-linear layers lose information just seems ridiculous on its face from a mathematical standpoint. Maybe some non-linear layers do, but not an arbitrary non-linear function. So (with some trepidation) I read the abstract of the paper and then skimmed just the section on the linear layers and then I think I see what they are talking about. All they talk about in that section is ReLU. So, yes, of course, ReLU destroys information: it’s a high-pass filter. Everything above 0 is passed through as is and everything less than zero disappears and has no effect on the output. So be careful which non-linear function you pick as your activation. If you used tanh, sigmoid or swish, you would not have this problem. But then the whole point of MobilNet is to be cheap to run in prediction mode in terms of both compute and memory, so maybe Leaky ReLU is really the only possible choice to avoid destroying information while still providing non-linearity at minimal compute cost. But they don’t mention that. But this is Science, right? They ran the experiments and proved that it works better with just pure linearity in the bottleneck layers. That’s the way Science works, so let’s go with that. Mind you I spent a total of < 5 minutes with the paper, so I can’t claim to actually understand any of it.
-
That’s a pretty general question, not specific to MobilNet. The way general convnets work, I think the way to look at the outputs at each conv layer are that each filter is distilling some kind of information from the geometry and values of the pixels in the input and passing that down perhaps at a reduced dimension. So I think it’s your first statement: you can think of each channel as containing distilled information about some particular feature of the input. But there are lots of ways to use ConvNets and they don’t all just result in a classification output (e.g. stay tuned for U-Net and YOLO in upcoming weeks).
So maybe more words than are justified by the amount of information in the above, but that’s what I’ve got this morning.