How cnn knows what features to extract from images

hello guys… i always wonder how exactly cnn extract important features from the images like if we input an image of face it extract features like eyes,ears,nose,mouth…etc but how do cnn know these are important features . how can it chose eyes and nose as important features like it doen’t even know that they are called eyes and nose technically. i know filter learn from detecting lines to complex shapes but i want to know how it choses features nose and eyes as important but not chin cheeks…?

Hi Adithya70,

Maybe this paper explains.


In general, the short story is that someone took a big set of images and created labeled points for all of the characteristics - essentially they clicked a mouse on the specific points of interest.
That data was recorded, and fed into a neural network. Given the images (inputs) and the identified points (output labels) as the training set, the NN learned how to generate those points.


I think it is challenging to answer a question like this because the field is so broad, and short forum answers may not universally apply. I feel like the previous replies both tilt toward landmark detection because the OP specifically used parts of the human face in the question. But I think generally the answer is that the power of neural networks is that you don’t have to tell the algorithm what the important features are. Rather, you provide a collection of labelled outcomes, a cost function, an activation function, and let the network learn what are the important features that enable it to meet its mathematical constraints and objectives. Sometimes it’s not intuitive or obvious exactly what the network thinks is important. Is it sharp edges? Bilateral symmetry? Aspect ratio? Color? Context? All the above? There isn’t enough time left in the life of our sun to fully specify all the important features of all the images neural nets have already learned to handle. That’s part of why they are so interesting/terrifying.

Specifically, to classify an object in an image as human, you don’t provide locations of the eyes, arms, legs, teeth, etc. You just say ‘this blob of pixels is a human.’ And so is this one. And this one. But these other blobs of pixels are not. You never explicitly enumerate how to differentiate human pixel blob from non-human pixel blob. Rather, the neural net figures out on its own by extracting whatever features it needs to.


yeah …and you put those words really well its very interesting and yet terrifying.

1 Like