Hello @DagerD,
Looking at this logistic loss function, that it has only two places - y and p for labels and L2’s outputs respectively, do you still think there is room for L1’s output in it?
Just because we used L1’s sigmoid output in gradient descent does not make it a probability.
Think about the following:
Our input features are temperature and duration. If we had put a sigmoid to the input layer, given that the features are used in gradient descent as much as L1’s output did, does it make Sigmoid(temperature) a probability?
I hope your answer is no.
Sigmoid + involvement in gradient descent is not a probability-maker.
Being from 0 and 1 is necessary to be a probability, but not sufficient. Involvement in gradient descent has nothing to do with the qualification of being a probability at all.
The reason I asked this is that it is a good starting point to think. Below I will demonstrate how it can connect to the answer:
Step 1 is just Math.
Step 2, 3, 4, 5 and 6 are by definitions. Read WIki or google for more yourself on Binomial distribution if you are not familiar with it.
Step 7 and 8 are what we give.
So, through the use of the logistic loss function and the provision of the label as y and the L2’s output as p, we are actually producing a neural network that models p.
p is probability, and so L2’s output is modelled to be probability.
It is L2’s output we give as p, not L1’s output, and not Sigmoid(temperature).
Cheers,
Raymond
PS: tagging @alex_fkh too.