Why convolution operation?

In normal neural network (perceptron):
Suppose I have given a vector x and I want predict y with perceptron. Here, our basic assumption is y = W^{t}*x i.e, we assume y will be a linear combination of the x components, which is a natural thinking.

In convolutional neural network:
Here in case of image we want something more than perceptron, which will not only predict y as some linear combination of the pixel values (X) but it will also treat the neighboring pixels together, for that purpose we use this convolution operator.

So, my question is why this convolution operation works to achieve our goal? I need a detailed information of the mathematical intuition behind this. why not any other operation? Is there any proof or something that this convolution operation is the best? same questions for pooling.

Hi, @Soumitra_Das !

To answer that, I’ll highlight the main differences between both approaches.

  • Convolutions are not densely connected, so not all input nodes affect all output nodes. This gives convolutional layers more flexibility in learning. In addition, fully-connected layers become dependent on the shape of the train images which might not be a good thing for the overall model.
  • Moreover, the number of weights per layer is a lot smaller in CNN, which helps a lot with high-dimensional inputs such as image data. These advantages are what give CNNs their well-known characteristic of learning features in the data, such as shapes and textures in image data. FCs have a larger number of weights, which means they are highly prone to overfitting, whereas a single convolution operation reduces the number of parameters quite significantly which makes it less prone to overfitting.

You are saying the benefits of convolution layer over fully connected layer, which are mainly:

  1. each cell of the next layer only depends on some small portion of the previous layer.
  2. Number of parameters is very less.
    Although my main question was why we calculate the cell value by sum of each cell (previous layer) multiplied by its corresponding weights, like we use W^{t} * x in perceptron.
    By the way, it’s clear to me now. it is just a replica of the W^{t} * x.
1 Like

In addition, also with respect to other learners who read this and want more information on convolution.

You might wanna take a look at this thread: How to Calculate the Convolution?

Best regards

Hi, @alvaroramajo,
Can you please tell me why we use max/average pooling? How it helps? of course it will reduce the number of parameters. But what is the main reason to introduce this idea?
Can you suggest me some paper or books where I can see the mathematical details behind this…

That’t basically the whole idea of pooling. Convoluting the inputs all the way to the final layers becomes computationally prohibitive with certain input sizes, so reducing the dimensionality is a necessity.

This paper (2011) may be one of the first to address this type of solutions, although this one discusses and comparates several methods (2020).