MaxPooling2D - layer size and related formulae

Hi, a few simple questions here.

I understand that max pooling serves to reduce the overall image data size once performed– But here in this non-graded example, the layer size keeps increasing. I feel like I am missing something.

Also, in the first FC layer 32 * 8 * 8 is used as input. I’m not sure I am catching the formula to handle this and suspect it also has something to do with MaxPool2D


def __init__(self):
    """Initializes the layers of the neural network."""
    # Initialize the parent nn.Module class
    super(SimpleCNN, self).__init__()
    # First convolutional layer (3 input channels, 16 output channels, 3x3 kernel)
    self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
    # Second convolutional layer (16 input channels, 32 output channels, 3x3 kernel)
    self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
    # Max pooling layer with a 2x2 window and stride of 2
    self.pool = nn.MaxPool2d(2, 2)
    # First fully connected (linear) layer
    self.fc1 = nn.Linear(32 * 8 * 8, 64)
    # Second fully connected (linear) layer, serving as the output layer
    self.fc2 = nn.Linear(64, 10)
    # Dropout layer for regularization
    self.dropout = nn.Dropout(p=0.4)

MaxPool(2,2) will reduce both the height and width by a factor of 2 and preserve the channel dimension. But we can’t see what the actual input image sizes are from the code you show and when you make the transition from a ConvNet section to the final FC section, the image size matters, right? All we can see is the transition of the number of channels, but when you do the “flatten” (which doesn’t show up there BTW. Does torch nn.Linear do an “auto-flatten”?) you need to know the h and w as well as c in order to compute the resultant vector sizes.

Which course and section is this from?

2 Likes

Hi, Anthony.

Now that I think \epsilon more, we can “back calculate” the image dimensions by using the same technique as for computing output dimensions on transpose convolutions.

At the FC layer, we can see that there are 32 input channels, so the images at that point are 8 * 8. So before the MaxPool layer, they were 16 * 16.

If we apply the transpose convolution formula (see this thread, which links to this one):

n_{out} = (n_{in} - 1) * s + f - 2p

With n_{in} = 16, f = 3, and s and p = 1, we get:

n_{out} = (16 - 1) * 1 + 3 - 2*1 = 16

Then it’s the same in the first conv layer, so it looks like the original inputs must have been 16 X 16. Does that make sense based on the context?

We can check our work by using the “forward” convolution formula:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

Which gives:

n_{out} = \displaystyle \lfloor \frac {16 + 2 * 1 - 3}{1} \rfloor + 1 = 16

Dear Paul,

Sorry it took me a minute to realize where I got this from. It is actually from the ungraded assignment on Hyperparameter tuning in Module 1 of Course 2 (I’ve updated the references). The images are CIFAR-10, so actually the images in their original state are 32 x 32.

I also originally omitted the ‘forward’ section of the functional model thinking it’d be to long, but this is where the layers are actually applied. It is found down below. And yes, you do need to either .Flatten() or in this case call ‘view’ in Pytorch to perform the same before an FC layer.

I also now see where the ‘32 * 8 * 8'comes from– we have 32 filters and the resulting image size has been cut in half, twice, due to Maxpool2D.

When I heard about this newly released course I figured it was about time I learn Pytorch and learn it well, so I am back here again.

Hope all is well.

Best,

-A

def forward(self, x):
“”"Defines the forward pass of the network.


    Args:
        x (torch.Tensor): The input tensor of shape (batch_size, 3, height, width).

    Returns:
        torch.Tensor: The output logits from the network.
    """
    # Apply first convolution, ReLU activation, and max pooling
    x = self.pool(F.relu(self.conv1(x)))
    # Apply second convolution, ReLU activation, and max pooling
    x = self.pool(F.relu(self.conv2(x)))
    # Flatten the feature maps for the fully connected layers
    x = x.view(-1, 32 * 8 * 8)
    # Apply the first fully connected layer with ReLU activation
    x = F.relu(self.fc1(x))
    # Apply dropout for regularization
    x = self.dropout(x)
    # Apply the final output layer
    x = self.fc2(x)
    return x
1 Like

Hi, Anthony.

Ah, ok, I am a bit rusty on my pytorch, I so missed the fact that the original code you showed just defined the layers, but didn’t show how they were actually applied. (I’ve started taking the new course as well, but I’m only up to C1 W3 at this point). Now that we see the forward module, the pooling is getting applied after each conv layer, so the original images are 32 x 32, as you say.

My only previous exposure to torch was taking the GANs specialization and I remember thinking how much cleaner torch is than TF, although at the point I took GANs originally it was 2019 and they had not yet upgraded DLS to TF v2 with Eager Mode. Torch is morally equivalent to TF Eager Mode, but is cleaner because they started that way without all the graph business in TF v1.

Cheers,
Paul