I did not think about cost functions and how derivations for each unit are carried out till I watched the “Larger neural network” video from the “Back Propagation Optional” videos. Till now I was thinking that every unit will have its own cost function and gradient descent especially because the layers can have unique activation function.

For Example lets consider a NN model with (Layer 1 : Sigmoid) and (Layer 2 : ReLU) and then (Output layer : Linear Function). If we are to make this model in tensorflow we only mention single cost function and that will be based on last output layer, in this case since it is linear function cost function can be MSE. I just thought every unit or every layer(since they have unique activation function) will have their own cost function and gradient descent algorithm. Now that I think about it, it does not work like that but I never gave it a serious thought.

So basically without backdrop we have to substitute Layer 1 function in Layer 2 function and then substitute that function into the output layer activation function and then find derivative with respect to all the parameters and then we can finally do gradient descent for every input value.

I dont know If I am slow to understand this or what but I feel like this is important to get a sense of how neural networks work even if we do not need it to use the neural networks and that why I think that this should not be optional.