Week 3 lab: Which video is this discussed?

The following is from W3 Lab2, where based on the partial derivative formulas there is specifically a call out about a video where this “dL/da * da/dZ = (a^(i) - y^(i))” is discussed. I can’t recall where the breakdown was discussed and how we concluded this formula. Would you please help directing me to the lesson link (or name)?

Thank you!

Week 2 is all background on partial derivatives and gradients.

Week 3 starts with three videos on “regression with a perceptron”. I think starting from there is where these equations are discussed.

1 Like

Ive reviewed w3’s videos Im struggling figuring out how we went from
image
to determining this:
image.

Would someone help me out on how we calculated dL/da.da/dz?

There’s a lot of calculus involved.
I’m not very good at calculus.
Perhaps someone else will reply.

1 Like

Hi!

There is an issue in C2 with some differences in notation between the lectures and labs. The labs were designed to match the notation of other courses from the DLAI series, like the Deep Learning Specialization (DLS). However, the lectures use a simpler form of notation.

In lectures, Luis talks about the notation \hat{y} to mean the predicted value. This predicted value is what comes out of the neural network, described by \sigma(w_1 x_1 + w_2 x_2 + b). The labs introduce a difference here: they define the weight matrix as W = [ w_1 \quad w_2 ]. This approach is used to handle cases with more neurons and inputs, leading to a matrix of size m \times n.

Both the lecture and the lab start the same way by defining z = w_1 x_1 + w_2 x_2 + b, but then they go in different directions:

  • Lecture says: \hat{y} = \sigma(z)
  • Lab says: a = \sigma(z)

The difference is because the lab wants to be clear that \sigma(z) is a function that turns any z value into a real number using the sigmoid formula. While it’s not wrong to use \hat{y} for this, it might be confusing because \hat{y} is supposed to be the predicted value for a specific point in the dataset, which we compare with the actual label, y.

In the lecture, the derivative of the loss function with respect to \hat{y} is shown as:

\frac{\partial L}{\partial \hat{y}} = -\frac{y - \hat{y}}{\hat{y}(1 - \hat{y})}

But in the lab, with its notation, it looks like:

\frac{\partial L}{\partial a} = -\frac{y - a}{a(1 - a)}

The lab also adds an index to y and a to make it clear that this calculation is for each point in the dataset.

Moreover, the lab introduces a step to compute \frac{\partial L}{\partial z}, which isn’t explicitly done in the lecture. Luis focuses more on the simpler case of directly computing \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, and \frac{\partial L}{\partial b}. This broader approach is key when dealing with more complex neural networks.

Let’s look at computing \frac{\partial L}{\partial z}:

Since L is the log-loss function and depends on z (which in turn depends on w_1, w_2, and b), the lab notation shows it as:

L(z) = L(W, b) = -y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1- a^{(i)})

Knowing that L also depends on a, and a comes from z, we can simplify this to L = L(a). This makes it easier to apply the chain rule:

\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z}

Since a is the sigmoid function applied to z, its derivative with respect to z is:

\frac{\partial a}{\partial z} = a(z) (1 - a(z)) = a(1-a)

Note that the above formula is actually not a partial derivative but rather just a derivative (one dimensional), as we are considering z as a pure variable. This notation is kept because in this context, z is also a function of the parameters.

Putting it all together:

\frac{\partial L}{\partial z} = -\frac{y - a}{a(1 - a)} \cdot a(1-a) = -(y -a) = a - y

While this step might seem unnecessary for simpler models with just three parameters, it’s crucial for neural networks with many parameters and layers.

I hope this makes things clearer. If you have more questions, just let me know.

I will ask the curriculum engineer who made the lab to mention this divergence, so future learners will not struggle with it.

Thanks,
Lucas

3 Likes

Here’s another thread that has a lot of the same derivations, although it’s based on how this was covered in DLS, so the notation may be a bit different.

3 Likes

Thank you so much! I thought I was losing my mind :smiley: I still have to internalize what you’ve described here. Also, the lesson right after this lab does touch upon the gradient descent calculation of complex neural networks. I would be benefited from watching that before walking through this (W3 Lab2) lab.