Originally written by: Edward Shyu
This is optional material that you can read after the week 2 video “Gradient descent on m examples.” You don’t need to know calculus in order to complete this course (or the other courses in the specialization), so this derivation is optional. This is for those who are curious about where the “dz = a - y” comes from.
This can be more fun and easier to digest if you follow along with a pencil and paper!
Derivation of \frac{dL}{dz}
If you’re curious, here is the derivation for \frac{dL}{dz} = a - y
Note that in this part of the course, Andrew refers to \frac{dL}{dz} as dz.
By the chain rule: \frac{dL}{dz} = \frac{dL}{da} \times \frac{da}{dz}
We’ll do the following: 1. solve for \frac{dL}{da}, then
Step 1: \frac{dL}{da}
L = -(y \times log(a) + (1-y) \times log(1-a))
\frac{dL}{da} = -y\times \frac{1}{a} - (1-y) \times \frac{1}{1-a}\times -1
We’re taking the derivative with respect to a.
Remember that there is an additional -1 in the last term when we take the derivative of (1-a) with respect to a (remember the Chain Rule). Also note that the notational conventions are different in the ML world than the math world: here log always means the natural log.
\frac{dL}{da} = \frac{-y}{a} + \frac{1-y}{1-a}
We’ll give both terms the same denominator:
\frac{dL}{da} = \frac{-y \times (1-a)}{a\times(1-a)} + \frac{a \times (1-y)}{a\times(1-a)}
Clean up the terms:
\frac{dL}{da} = \frac{-y + ay + a - ay}{a(1-a)}
So now we have:
\frac{dL}{da} = \frac{a - y}{a(1-a)}
Step 2: \frac{da}{dz}
\frac{da}{dz} = \frac{d}{dz} \sigma(z)
The derivative of a sigmoid has the form:
\frac{d}{dz}\sigma(z) = \sigma(z) \times (1 - \sigma(z))
You can look up why this derivation is of this form. For example, google “derivative of a sigmoid”, and you can see the derivation in detail.
Recall that \sigma(z) = a, because we defined “a”, the activation, as the output of the sigmoid activation function.
So we can substitute into the formula to get:
\frac{da}{dz} = a (1 - a)
Step 3: \frac{dL}{dz}
We’ll multiply step 1 and step 2 to get the result.
\frac{dL}{dz} = \frac{dL}{da} \times \frac{da}{dz}
From step 1: \frac{dL}{da} = \frac{a - y}{a(1-a)}
From step 2: \frac{da}{dz} = a (1 - a)
\frac{dL}{dz} = \frac{a - y}{a(1-a)} \times a (1 - a)
Notice that we can cancel factors to get this:
\frac{dL}{dz} = a - y
In Andrew’s notation, he’s referring to \frac{dL}{dz} as dz.
So in the videos:
dz = a - y