(it’s a long-winded question, but please try to follow my thought process through the entire question before responding to some statements at the beginning of the post).
In a few of the training videos, Andrew discusses how to think about the way neural networks can perform image recognition. A few places that I’m referring to:
- module 1 “example: recognizing images” around the 2:30 point.
- module 1 “inference: making predictions (forward propagation)” around the 2:00 point.
In Andrew’s explanations, while talking about multi-layer networks with dense layers, he describes the way the algorithm is able to look at smaller parts of the image:
- in the first layer (ex: facial recognition video) to first find lines that will define features.
- Then in a second layer, take groups of those smaller lines
- Then in a later layer, categorize the groups as features (this type of nose, this type of ear, etc).
This makes alot of sense to me as a strategy for breaking down an image recognition problem into smaller steps. As someone with a programming and engineering background, when I heard this explanation, I assumed we had to teach the model how to break down parts of the images to look at the features. For example, if talking about the MNIST handwritten digit recognition, I figured we would have to “do something” to instruct the model which parts of the image it should look at to recognize a “6” from a “9”, etc.
However, I was surprised to see that we never had to do anything. And I think it’s a fundamental concept with machine learning that I wanted to post this question to gain either confirmation or clarification.
So, am I correct in this following summary? The reason why neural networks are so powerful, is because we don’t have to do anything to tell the model which parts of the handwritten digit images to look at. Rather, we create a tensorflow neural network (we can try 25/15/10, we can try 50/24/12/2, etc. etc.) and then train the model on a dataset such as MNIST, and the model “just figures it out” by adjusting the weights around in different directions until the weights land in a place that the model accurately predicts the digit in the image more often than not?
If I’m understanding correctly, then it’s not clicking for me just how the model decides to look at certain parts of the image.
When we use the 25/15/10 dense layer NN (3 layers, 400pixel input → 25 dense → 15 dense → 10 dense) in the MNIST handwritten digit example… If the definition of dense means that you have an all-to-all connection from every neuron in the first layer to every neuron in the second layer, how does the model learn to train each neuron in the layer to different sections of the image? In other words, how does the model know to look at pixels in the top left with the first neuron, and pixels in the bottom right with the 14th neuron (for example).
I think I understand the math to it, IIUC, the pixels that each neuron in the first layer “looks at” are determined by the weights of w. So if w uses a weight near “1” for some pixels, and a weight near “0” for the rest of the pixels, statistically the importance of the pixels with higher weights will influence the output value more.
I guess my question is:
If it’s all-to-all (in a dense layer), what makes the weights (in neuron 1) for the pixels in the top left of the image larger than the weights for the pixels in other parts of the image? Nowhere in our exercise did we instruct the neurons in the first dense layer which parts of the image to look at. And if it’s all-to-all, each neuron in the first layer should have acces to the entire pixel set, so how does it learn to divide and conquer parts of the image?