Troughout the course we’ve seen references that neurons “automatically” specialize in ways somewhat analogous to the “feature engineering” we saw on Course 1: for example we’ve seen that a Neural Network for face recognition would tend to have the nodes on its first hidden layer recognize clumps of pixes as lines, then the nodes on the second hidden layer can build on their output to recognize features like eyes, noses and mouths, then a third hidden layer could put those together to assemble faces, and so on. I’m trying to build my intuition as to why this is the trend that NNs follow.
I’ve seen this excellet comment by @rmwkwok which explains that this all stems from the fact that the at the beginning the weights are initialized differently, and thus, given the derivatives will also divert, each weight will follow its own learning path. But how come this results in each node seemingly specializing in discerning a specific pattern in the data?
Another example comes from this week’s Multiclass lab, which classifies the data into 4 categories using 2 nodes in the hidden layer. When we plot the weights after training the model, we see that the first node created a decision barrier that separated categories 0 and 1 from categories 2 and 3, and the second node’s decision barrier separated categories 0 and 2 from categories 1 and 3. In the “model decision boundary” graph we can see how this combines to a neat 4 way classification that fits our categories.
How come we get these trends from random weights? How come they don’t just look random to us in the hidden layers and somehow combine to correctly produce the output? How come it seems each node finds its own distinct niche, instead of seeing more than one node converge in what they seem to be detecting?
That’s not really how NN layers actually work. It’s just an intuitive way to explain the concept of NN’s and layers which students can readily understand.
All you can say with confidence is that the weight values are adjusted so that the cost on the training set is minimized. There are no guarantees about what any specific neuron or layer is going to learn.
In the example you mention, the NN learns to identify each individual label because we constructed and labeled the training set so that this is exactly what would happen. Each output unit learns to identify one individual label, and ignore the others. It’s the “one-vs-all” method, where you have N outputs for N labels. It’s implemented by one-hot coding the labels so each one is a one-hot vector.
@TMosh thanks for the response! I guess I was reading too much into the wording used in the materials, but “All you can say with confidence is that the weight values are adjusted so that the cost on the training set is minimized” makes it very clear
@TMosh in Week 3 there’s another video that made me go back to this question, the one regarding transfer learning. In it the example of image recognition is brought back, and it’s mentioned that nodes in a trained NN for recognizing objects may have learned to detect edges in one layer, then corners in another, then basic shapes in a further layer. I understand this language is meant to build our intuition on the inner workins of NNs, but still, this is given as the justification as to why initializing a NN for another type of image recognition, for example classification of handwritten digits, with the weights of the first NN, works. So this implies that there is a degree of domain specificity that derives from the learning process in the hidden layers and nodes. Would that be a fair assertion?