Clarify my understanding on NN: we don't actually specify what parts of an image the algorithm should look at

(it’s a long-winded question, but please try to follow my thought process through the entire question before responding to some statements at the beginning of the post).

In a few of the training videos, Andrew discusses how to think about the way neural networks can perform image recognition. A few places that I’m referring to:

  • module 1 “example: recognizing images” around the 2:30 point.
  • module 1 “inference: making predictions (forward propagation)” around the 2:00 point.

In Andrew’s explanations, while talking about multi-layer networks with dense layers, he describes the way the algorithm is able to look at smaller parts of the image:

  • in the first layer (ex: facial recognition video) to first find lines that will define features.
  • Then in a second layer, take groups of those smaller lines
  • Then in a later layer, categorize the groups as features (this type of nose, this type of ear, etc).

This makes alot of sense to me as a strategy for breaking down an image recognition problem into smaller steps. As someone with a programming and engineering background, when I heard this explanation, I assumed we had to teach the model how to break down parts of the images to look at the features. For example, if talking about the MNIST handwritten digit recognition, I figured we would have to “do something” to instruct the model which parts of the image it should look at to recognize a “6” from a “9”, etc.

However, I was surprised to see that we never had to do anything. And I think it’s a fundamental concept with machine learning that I wanted to post this question to gain either confirmation or clarification.

So, am I correct in this following summary? The reason why neural networks are so powerful, is because we don’t have to do anything to tell the model which parts of the handwritten digit images to look at. Rather, we create a tensorflow neural network (we can try 25/15/10, we can try 50/24/12/2, etc. etc.) and then train the model on a dataset such as MNIST, and the model “just figures it out” by adjusting the weights around in different directions until the weights land in a place that the model accurately predicts the digit in the image more often than not?

If I’m understanding correctly, then it’s not clicking for me just how the model decides to look at certain parts of the image.

When we use the 25/15/10 dense layer NN (3 layers, 400pixel input → 25 dense → 15 dense → 10 dense) in the MNIST handwritten digit example… If the definition of dense means that you have an all-to-all connection from every neuron in the first layer to every neuron in the second layer, how does the model learn to train each neuron in the layer to different sections of the image? In other words, how does the model know to look at pixels in the top left with the first neuron, and pixels in the bottom right with the 14th neuron (for example).

I think I understand the math to it, IIUC, the pixels that each neuron in the first layer “looks at” are determined by the weights of w. So if w uses a weight near “1” for some pixels, and a weight near “0” for the rest of the pixels, statistically the importance of the pixels with higher weights will influence the output value more.

I guess my question is:

If it’s all-to-all (in a dense layer), what makes the weights (in neuron 1) for the pixels in the top left of the image larger than the weights for the pixels in other parts of the image? Nowhere in our exercise did we instruct the neurons in the first dense layer which parts of the image to look at. And if it’s all-to-all, each neuron in the first layer should have acces to the entire pixel set, so how does it learn to divide and conquer parts of the image?

Yes.

Hello, @jtombs,

This is a great question!

Let me first clarify, with the screenshot below, that while those vertical blocks are the way we represent “dense” layer in our subsequent videos, Andrew was actually referring to a “convolutional” neural network when he discussed the example of detecting from very localized and small features to larger features.

In fact, if you search the word “dense” in the video to which the slide above belong, you won’t find any.

Dense layer is covered in this specialization and it is the most basic form of neural network layer, but “convolutional” layer is not covered. Convolutional layer is covered in another, more advanced, course belonging to the Deep learning specialization (DLS).

Since this resonates with you, I think we should not be happy with just my response so far. Let’s cross the line a bit and step into the DLS for why convolutional layers can deliver that strategy.

I will try to be brief here.

Neuron in a convolutional layer is also called “filter” which takes size. For example, I can have a convolutional layer with filters of size 3 pixels by 3 pixels. This filter, in one way of explaining it, scans the whole image by moving, 1 pixel at a time, from left to right and from top to bottom. Google something like “convolutional layer gif animation” if you prefer a visual :wink:

At each step of the scan, it evaluates an output. So, for a 5 x 5 image, a 3 x 3 filter will walk 9 steps and get 9 output values and these 9 output values form a 3 x 3 map (now we don’t call it image) which will feed into the next convolutional layer.

Back to, as you described, “look at smaller parts”, this is how it looks at smaller parts by walking through the image with a small filter. If the filter is 3x3, then each of the smaller parts is 3x3. You may google “horizontal line convolutional filter” for how it works to detect a horizontal line.

Since the output of a filter is a map (again, we don’t call it image because it’s no longer any visually recognizable thing) where each value of the map represents information from not just a pixel of the original image but a group of pixels (3 x 3 = 9 pixels for our example). As we apply another filter to this map, let’s say a 4 x 4 filter this time, then, essentially, this new filter is scanning, at each of its step, information of 6 x 6 = 36 pixels in the original image:

Therefore, the early convolutional layers tend to look at smaller parts while the deeper ones bigger parts.

Should we consider the choice of convolutional layers the way we instruct the model how to break things down? This sounds like a reasonable statement to me. :wink:

I think my question above echos with and perhaps answered yours in the last paragraph of your post. What do you think?

Cheers,
Raymond