Expansion and projection (pointwise convolution) in MobileNetV2

Hi I don’t understand how Expansion and projection convolution works and helps.

According to the Ex-3 in Assignment 2 of week-2
"Each block consists of an inverted residual structure with a bottleneck at each end. These bottlenecks encode the intermediate inputs and outputs in a low dimensional space, and prevent non-linearities from destroying important information.

In the bold highlighted region:
ques1. what does inverted residual structure mean and how does that help?
ques2. what does bottleneck mean and How does it help?
ques3. how do non-linearities destroy information? and how does encoding in low dimensional space help with preventing these non linearities.

question 4. does every channel of the output of the convolution represent a particular feature? or do each feature map value represent a feature?
@paulinpaloalto @ai_curious can You please provide some insight sir

1 Like

I’ll defer to @paulinpaloalto and his esteemed mentor colleagues to take a swing at this one. Meanwhile, if you want to understand why the answer is sometimes ‘YES’ to the question of whether you need to get a PhD, take a slog through the MobileNet paper ==> https://arxiv.org/pdf/1801.04381.pdf

Now I’m going to go take something for this headache!

1 Like

I can make a few comments, but I should start with the disclaimer that the MobilNet material is all brand new in the “refresh” version of the courses released in April 2021. So I haven’t lived with it for very long yet. I just went through it quickly back in April to be able to answer questions on the assignment, which doesn’t really deal with the actual details of the MobilNet model and treats it as a black box. So I don’t really know anything about the details of MobilNet itself. So here are a few things I can think of that are relevant to your questions:

  1. The comment comparing the architecture to residual nets doesn’t seem to really help understanding how things work here. If you compare the two architectures at the level of the diagrams, the skip layers seem to be the main point of comparison. In the MobilNet case, they skip between the (apparently linear) bottleneck layers. In residual net, you have no bottleneck layers and the skip layers can either preserve or convolve the input to output dimensions, skipping over several pretty flavor vanilla conv layers. So they are different and I don’t see how one is the inverse of the other. Doesn’t seem like that is very helpful, but that probably just means I’m missing the point.

  2. A bottleneck is just a standard English usage that refers to some sort of pathway that shrinks down in size. E.g. you’re on 4 lane road and it necks down to 3 lanes and then 2 lanes. That’s a “bottleneck”. Here they just mean that they are reducing the dimensionality from input to output. The interesting bit is that each full composite layer expands and then contracts. So intuitively they are taking a smaller input, doing something more complex to analyze it and then distilling it back to a lower dimension after just two convolution layers. Of course the “magic” here is that there are coefficients at every single layer and those are learned through back prop. So it either learns something interesting and useful or it doesn’t. They seem to say that (if properly configured) it can learn something useful with this architecture.

  3. The statement that non-linear layers lose information just seems ridiculous on its face from a mathematical standpoint. Maybe some non-linear layers do, but not an arbitrary non-linear function. So (with some trepidation) I read the abstract of the paper and then skimmed just the section on the linear layers and then I think I see what they are talking about. All they talk about in that section is ReLU. So, yes, of course, ReLU destroys information: it’s a high-pass filter. Everything above 0 is passed through as is and everything less than zero disappears and has no effect on the output. So be careful which non-linear function you pick as your activation. If you used tanh, sigmoid or swish, you would not have this problem. But then the whole point of MobilNet is to be cheap to run in prediction mode in terms of both compute and memory, so maybe Leaky ReLU is really the only possible choice to avoid destroying information while still providing non-linearity at minimal compute cost. But they don’t mention that. But this is Science, right? They ran the experiments and proved that it works better with just pure linearity in the bottleneck layers. That’s the way Science works, so let’s go with that. Mind you I spent a total of < 5 minutes with the paper, so I can’t claim to actually understand any of it. :nerd_face:

  4. That’s a pretty general question, not specific to MobilNet. The way general convnets work, I think the way to look at the outputs at each conv layer are that each filter is distilling some kind of information from the geometry and values of the pixels in the input and passing that down perhaps at a reduced dimension. So I think it’s your first statement: you can think of each channel as containing distilled information about some particular feature of the input. But there are lots of ways to use ConvNets and they don’t all just result in a classification output (e.g. stay tuned for U-Net and YOLO in upcoming weeks).

So maybe more words than are justified by the amount of information in the above, but that’s what I’ve got this morning. :grin:

3 Likes

Thank you so much for taking the time and effort for writing such a long and helpful response !! that part about RelU losing information but being cheaper due to being high pass filter is very interesting ! I still have a few lingering doubts.

  1. ques 2 on bottlenecks-- so expansion conv helps in extracting more no of features or channels from features or channels in previous layer then the padded depth-wise convolution is clubbing it into bigger features in each channel and then these bigger features are reduced again to a lower dimensional space which helps in what exactly?

and are non linearities applied after each of the convolutions in this bottleneck?

  1. ques.3 how does encoding in low dimensional space help with preventing these non linearities to destroying importing information.

As I said in my previous disclaimer, I have not personally looked at the internal details of how the MobilNet modules are actually constructed. So I can only talk in general terms based on the diagrams and the words Prof Ng said.

I think the point about the bottlenecks is all about keeping the costs down, both in terms of compute and memory. The interesting point is that you can train the network to extract the information that matters and still reduce the dimensionality. As I pointed out in my previous comment, the key point is that all this behavior is learned. You specify the sizes of the various filters and then the training does the rest. There is no a priori reason to believe that any of this is guaranteed to work. What they have demonstrated is that it does, if you construct the network appropriately.

There is some other very interesting work about pruning networks, but that is beyond the scope of these courses and I have not had time to look into it myself yet. Here’s a paper someone linked in another Discourse post that sounds totally interesting along the above lines. But as with the MobilNet paper, there is no guarantee that it will be easily accessible unless one has the appropriate math and DL background.

1 Like

I think the comment is misleading in the way it conflates the lowering of the dimensionality with the prevention of the loss of information from non-linearity.

As I mentioned above, the way they eliminate the loss of information from ReLU is that they eliminate ReLU: they use no activation functions in the bottleneck layers at all. At least that’s what I take from the verbiage, without actually having looked at the structure of the model.

When you reduce dimensions, unless you expand the number of output channels at the same time, you most likely are ending up with fewer output variables. So you’ve “lost information” in some generic sense, but the whole point is that the network can learn which pieces of information are important and which are not. That’s what the training is doing. If you’re looking for a cat in a picture and start with 12288 pixel values and end up with a single bit answer, you’ve “lost information”, but (one hopes) the answer is correct and that’s what you were after.

2 Likes