Hidden layer feature extraction

How nodes in a hidden layer extract useful feature?


In simple words, gradient descent has adjusted the network’s parameters to some values such that the network was able to transform the inputs into something predictive of the label.

Gradient descent achieved this by minimizing the cost, which is also the only thing that gradient descent is tasked to do. It wasn’t tasked to extract feature, but since it minimized the cost (in other words, made predictions very close to labels), we human could say the neural network was able to transform inputs into some good features. They were good because the final predictions were good.

Neural Network as “transforming X into X_t” is illustrated in below: