How do layers "just" learn different features on their own?

This has been bugging me for a while. Andrew is a great instructor and he mentioned something about deep learning where different layers learn to detect different features, etc.


Multiple times, it is mentioned how, let’s say, layer #1 will detect edges, layer #2 will learn to detect shapes, layer #3 will learn to detect organs.
I don’t understand the intuition here.


Let me elaborate more please. Let’s take end-to-end deep learning where it can be used if we have a lot of training data.

So, taking end-to-end deep learning, I am assuming that somehow (where (A) is still unclear to me) it will learn what it needs to learn based on the Y that we ask it to learn.


I hope you can try to help me understand the intiution about (A) and (B). At the end, all we are doing is finding the proper weights and biases…

Thank you!

Welcome to the community !

For your question A), actually, there are two cases. One is a traditional computer vision to apply different types of “Fixed” filters to detect objects. And the other is a deep learning approach to use a convolutional network that we will learn in the Course 4 of this specialization. I think the 2nd one is more intuitive and also good to understand deep learning.

Here is Andrew’s picture about convolutional neural network.

In short, we use different types of filters to extract features from an image. And, you see the size (height and width) is getting smaller, but the depth (called channels) is getting larger.
Then, think about using, say 5x5 filter. If this filter is applied to 224x224x3 image, then, it can only focus on a very small portion of an image like just edges in a small area. But, if the same size of filter is applied to 26x26x256, then, it can cover larger area, i.e, can detect shapes.
And, so on. This paper shows actual focuses of those layers visually, which will help you to understand what’s happening in each layer.

For your 2nd question B), we should start with Andrew’s example.

The traditional approach (upper figure) consists of multiple steps. And, each step is an independent task and can be optimized as a task. So, there are multiple human interventions, i.e, human fixes a goal of each step, i.e, intermediate feature map. On the other hand, in the case of end-to-end learning, there is no intermediate feature map that human designs. As you know, each step, in the traditional way, has some weights. But, it only focuses on a small task. So, its number is not so big.
But, in the case of end-to-end learning, it can be a combination of all tasks, i.e, the number of combination of weights become huge.

This types of large (deep) neural network is hard to be converged, since there are two many combinations of weights. (just like defining co-efficiencies for a multi-variate polynomial)
To make this deep network converted, we need more data and more iterations which require computational power. (In that sense, one of the reason that end-to-end deep learning is getting focused is an evolution of computational power like GPU.)

Hope this helps some.

1 Like