Hello again @Jaskeerat. Derivation may be too strong a word choice. The choice of the loss function is closely related to the choice of output units (activations in the output layer, e.g., sigmoid, softmax). In this sense, they are not chosen arbitrarily.
One must keep in mind here what the output layer is trying to accomplish. The hidden features that feed in as inputs to the output layer have no discernable meaning. So the output layer provides additional information to complete the task at hand. Here, it is a classification task so that probabilities to help decide which category an example belongs to, are useful.
And, importantly, since one can normalize any list of positive values by dividing by their sum – so that all the normalized values are in [0,1] and sum to one – the activation functions of the output layer are certaintly not chosen on that basis alone!
So we are looking for something special about sigmoid and softmax, right? OK, buckle up. Assuming one has a strong foundation in probability and statistics, it is understood that the loss function (cost function is the average of the losses) is typically chosen to be the cross-entropy between the data and model probability distributions. And that, is equivalent to the negative of the log-likelihood function of the model distribution. In the case of binary and multi-class classification, the model distributions are naturally chosen to be the Bernoulli and multinomial distributions.
The above emphasis turns out to be critical. Because log-probabilities are common in the statistical modeling literature. This was exactly my point in having you construct the sigmoid function from the logit model, because it shows you how the sigmoid unit can be motivated using the assumption that log probabilities are linear in z.
Confession: I played a small trick on you by taking the log of the inverse odds-ratio. This automatically normalized the probabilities (i.e. they would sum to one) in advance. You could try it again using log(y) on the right-side and then apply the normalization after the fact.
But what you did not know, was that log probabilities are special. You also didn’t know that cross-entropy is closely related to maximum likelihood and maximum likelihood is a miracle! Trust me. If you are still confused after thinking about this, good! I have spent most of my life that way and I can report that its (mostly) harmless. In fact, its the only thing that guarantees those “Ah-ha!” moments.
The internet is an ocean of information, so I encourage you to to practice your Google-Fu as you move on. On that note, I hope that you are enjoying Course 2!
Onwards and upwards!