Hi!
There is an issue in C2 with some differences in notation between the lectures and labs. The labs were designed to match the notation of other courses from the DLAI series, like the Deep Learning Specialization (DLS). However, the lectures use a simpler form of notation.
In lectures, Luis talks about the notation \hat{y} to mean the predicted value. This predicted value is what comes out of the neural network, described by \sigma(w_1 x_1 + w_2 x_2 + b). The labs introduce a difference here: they define the weight matrix as W = [ w_1 \quad w_2 ]. This approach is used to handle cases with more neurons and inputs, leading to a matrix of size m \times n.
Both the lecture and the lab start the same way by defining z = w_1 x_1 + w_2 x_2 + b, but then they go in different directions:
- Lecture says: \hat{y} = \sigma(z)
- Lab says: a = \sigma(z)
The difference is because the lab wants to be clear that \sigma(z) is a function that turns any z value into a real number using the sigmoid formula. While it’s not wrong to use \hat{y} for this, it might be confusing because \hat{y} is supposed to be the predicted value for a specific point in the dataset, which we compare with the actual label, y.
In the lecture, the derivative of the loss function with respect to \hat{y} is shown as:
\frac{\partial L}{\partial \hat{y}} = -\frac{y - \hat{y}}{\hat{y}(1 - \hat{y})}
But in the lab, with its notation, it looks like:
\frac{\partial L}{\partial a} = -\frac{y - a}{a(1 - a)}
The lab also adds an index to y and a to make it clear that this calculation is for each point in the dataset.
Moreover, the lab introduces a step to compute \frac{\partial L}{\partial z}, which isn’t explicitly done in the lecture. Luis focuses more on the simpler case of directly computing \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, and \frac{\partial L}{\partial b}. This broader approach is key when dealing with more complex neural networks.
Let’s look at computing \frac{\partial L}{\partial z}:
Since L is the log-loss function and depends on z (which in turn depends on w_1, w_2, and b), the lab notation shows it as:
L(z) = L(W, b) = -y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1- a^{(i)})
Knowing that L also depends on a, and a comes from z, we can simplify this to L = L(a). This makes it easier to apply the chain rule:
\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z}
Since a is the sigmoid function applied to z, its derivative with respect to z is:
\frac{\partial a}{\partial z} = a(z) (1 - a(z)) = a(1-a)
Note that the above formula is actually not a partial derivative but rather just a derivative (one dimensional), as we are considering z as a pure variable. This notation is kept because in this context, z is also a function of the parameters.
Putting it all together:
\frac{\partial L}{\partial z} = -\frac{y - a}{a(1 - a)} \cdot a(1-a) = -(y -a) = a - y
While this step might seem unnecessary for simpler models with just three parameters, it’s crucial for neural networks with many parameters and layers.
I hope this makes things clearer. If you have more questions, just let me know.
I will ask the curriculum engineer who made the lab to mention this divergence, so future learners will not struggle with it.
Thanks,
Lucas