General Qn on pooling

Hi guys

I understand the math behind pooling and how it extracts certain values based on the pooling layer. But i kinda have issues understanding how it knows where the features are in an image based of the value that is extracted from a pooling layer. Can anyone shed some light as to the inner workings? Becuz so far i have seen a few different schools of thought and would like to see if the prespective of the community.

Hey @zheng_xiang1,
Let me see if I get your question right. Let’s say we have some input as follows:

A = [[1, 2], [3, 4]]

Now, we apply max-pooling operation to this input, so the output will be as follows:

A_max_pool = [4]

Now, if your question is based on this output value 4, how do we know that the feature was situated in the second row, second column. If yes, then the answer to this is very straight-forward. When back-propagation happens via a max-pooling layer, then a mask is created which stores a reference to the associated feature with max value. You will learn more about this in C4 W1 A1 Assignment. And if your question is something else, then please do let us know.


hi thanks for the reply, but my question is more related to how it actually extracts features from an image, meaning like how does the pooling make sense of the image features after a few layers of convolutions. so no so much on the code but if i were to ur reply it would be what does 4 in this case means to the program, how is 4 import compared to 1,2,3?

Hi @zheng_xiang1 ,

This is my attempt to explain this.

Lets remember that the head of a convolution in a CNN actually involves 3 transformations each:

  1. A filter is applied
  2. An activation is applied
  3. A pooling (like MaxPooling) is applied.

Transformations 2 and 3 are pretty simple math (one is a non-linear function applied to each neuron and the other is simply a pooling of some sort, like max pooling picking the max value of a patch), and the secret of the feature extraction actually lies in Transformation 1: the applied filter.

Transformation 1: The filter is applied by patches. Here is where the magic happens.

In class we see how Prof Ng writes down some pre-determined filters, usually 3x3 in his examples. But in “real life” it is not like we define explicitly the filters. The only definitions we make as system designers are: the size of the filter and the strides. And the number of filters. For instance: Filter:3x3 Stride 2 1 Filter (or 2 or 3 or more filters). These filters start with random numbers. I am not sure but may be they can also start with zero as an option.

When the training starts, we will see the usual FwdProp and BwdProp. Here is where this ‘magic’ starts to happen and the netowrk starts to learn the filters. The filters end up being whatever the network learned. As we go with the Fwd and Bwd propagations, on each pass, when the CNN attempts to close the gap between predicted and truth, the filters’ values are updated.

At the end which filters are configured to pick which features? this, in general, is a bit of a mystery, and some studies have been done and continue being done to better understand this. In fact, Prof Ng shows some of the depictions of the results of these filters, but there is still some understanding to reach.

So again, the features are extracted with the filters. The filters are learned by the CNN while training.

Yes, I know. Still the same question: how does it do it? I guess this is as far as I can explain it :slight_smile:



1 Like

Very much insightful but if i were to ask a little bit on the explanation, features being extracted by the filters is understandable but then why is there an inclusion of a max pool which does something similar? Not saying that it extracts information like the main features but hypothetically u could just have convolutions of more filters instead of pooling layers right cuz in the video andrew explained that “no one really knows why pooling works” which i find to be puzzling.

The pooling in CNN has several functions:

  1. Reduction of dimensions. This one is easy to see. You can reduce the input of the pool by several factors when applying the pooling function. This allows you to work efficiently with very large initial inputs (like very large images).

  2. Highlighting of hierarchy and relevance of features. Although I mentioned filters get the most out of features, pooling does have also an effect in features which is it ‘condenses’ the features, or shows the most salient features while performing the reduction of dimensions.

  3. Speed and memory efficiency. By reducing the size of the information being processed, the pooling helps with speed and computational efficiency.

I am not sure how it would work if you didn’t use pooling in the CNN… may be you can achieve something - good idea to experiment, but I can imagine that Mr Yann LeCun, who came up with the CNN as we know it, may have gone through those experiments and I guess he found out that adding the pooling was a better solution.

Absolutely - I agree. It is puzzling.

1 Like

Truly interesting topic

Hello @zheng_xiang1! This is an interesting thread.

I think, generally and usually the most prominent feature in a region of the image has the maximum value. That is how MaxPooling extracts prominent features from an image.

However, there may be some exceptions (noise or distortion). In that case, using average pool perform better than max pool.

These are just my understanding. Open to correction…


1 Like