Is it correct if I say that each hidden layer in a neural network basically represents a new logistic regression unit? And does something like the overall cost function of the neural network exist? If yes how will we write it?
Well, note that there are other things that are different in the hidden layers than in a Logistic Regression layer:
- The activation function is a choice: it doesn’t have to be sigmoid. It could be tanh, ReLU, Leaky ReLU, swish or yet other choices.
- A Logistic Regression layer has just one single output: a single neuron scaled by sigmoid that you then interpret as a “yes/no” answer. In an internal hidden layer of a network, you have many output neurons, each of which feeds into all the input neurons of the next layer.
In terms of the cost, remember that is done only at the output layer. That means that in the case that you are doing a binary (“yes/no”) classification, it looks exactly the same as in the Logistic Regression case: we apply the cross entropy loss function (“log loss”) to the output of sigmoid and then take the average over all the samples to get the scalar cost value. But when we compute the gradients w.r.t. the cost, those are for all the weights and bias values at every layer, so we have to “back propagate” through the hidden layers to compute those using the Chain Rule. So it looks the same at the output layer, but then it gets more complicated when we propagate back through the hidden layers.