Back Prop question

The pre-activation at layer l+1 is \displaystyle z^{[l+1]}_j = \sum_{r=1}^{n_l} a^{[l]}_r W^{[l+1]}_{r j} + b^{[l+1]}_j.
The activation at layer l is a^{[l]}_k = g(z^{[l]}_k).
We apply the chain rule to each scalar component

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}.

The first term \displaystyle \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} is just the upstream gradient from layer l+1.
For the second term \displaystyle \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} from the forward pass we have

\frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} = \sum_{r=1}^{n_l} \partial \frac{ a^{[l]}_r W^{[l+1]}_{r j} + b^{[l+1]}_j}{\partial a^{[l]}_k} = W^{[l+1]}_{k j}.

For the third term \displaystyle \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}, by definition a^{[l]}_k = g(z^{[l]}_k) \Rightarrow \displaystyle \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k} = g'(z^{[l]}_k).

Therefore,

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} W^{[l+1]}_{k j} \right) g'(z^{[l]}_k)

Now note: for fixed k, this is a dot product between the upstream gradient vector \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}} and the k-th row of W^{[l+1]}.

Hence, in vector form:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}} W^{[l+1]\top} \right) \circ g'(z^{[l]}).
1 Like

Prof. Ng uses a conceptual and intuitive approach to explain backpropagation. In contrast, the mathematics we’ve been discussing is more formal, typically it is used in advanced ML/DL courses such as CS229, CS231n. By the way, your questions are part of the homework in those courses.
The reason I chose to compute the gradient of the loss first was to keep the math simple. As a side effect, it also serves as a nice illustration of the gradient accumulation technique.
However, in industry, this is implemented differently. If you’re interested in the details of how automatic differentiation works, I highly recommend watching Andrej Karpathy’s video. He is a former CS231n instructor and does a fantastic job breaking down how modern deep learning frameworks like PyTorch compute gradients under the hood. The video walks through the core ideas behind reverse-mode automatic differentiation and even builds a minimal autograd engine from scratch, which is both educational and inspiring.

1 Like

Can you take me through how you arrive at this term?

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}

The loss \mathcal{L}^{(i)} does not directly depend on z^{[l]}_k, but rather through later layers, we apply the chain rule through the next layer z^{[l+1]}.
Recall the forward pass

z^{[l+1]}_j = \sum_{r=1}^{n_l} a^{[l]}_r W^{[l+1]}_{rj} + b^{[l+1]}_j.

This means that each activation a^{[l]}_k contributes to every neuron in the next layer through the weight matrix W^{[l+1]}.
Therefore, changing z^{[l]}_k affects a^{[l]}_k = g(z^{[l]}_k), which affects all z^{[l+1]}_j, which in turn affects the loss \mathcal{L}^{(i)}. Thus, we must sum over all j in the next layer. That’s a direct consequence of the chain rule.

I see, thanks.

That’s what I suspected was going on.

It’s been a while since I had to use calculus but I see what is happening now.

1 Like

Is there a single document that takes me through all these calculations for back prop from first principles because obviously Andrew doesn’t present these in his MLS->Advanced Learning Algorithms course-> Week 3 so far anyway.

If you didn’t like CS229 notes, perhaps you could try CS224n or CS231n notes. I think they are more accessible but less detailed.

Actually I haven’t had a chance to view cs229 but I have now saved 229, 224n and 231n on my mac for later reading.

Thanks.

Here’s a thread with links to lots of websites that cover these issues, including the Stanford courses that Pavel mentioned.

One of the sites linked from that thread is this one, which is worth a look. It was created by Jonas Slalin, who was a mentor for DLS for several years. It starts with a thorough treatment of Forward Propagation and then links to his other pages that cover back prop.

Thanks that’s all very useful background information but I need to spend my time on gaining industry-standard training like MLS.

Fair enough, but note that if your goal is to be able to implement solutions in an “industry standard” way, then you don’t need to know anything about the mathematics behind back propagation. That is because the industry standard way to implement solutions is by using an ML platform like TensorFlow, PyTorch or one of the others. In every case, back propagation is implemented for you by the platform using “auto differentiation” techniques. So all you need is the intuitive understanding of back prop that Prof Ng has carefully designed these courses to give you. Pavel deserves a big thank you for all the effort invested to document the various aspects of the mathematics of back prop here.

Andrew’s intuitive explanation of back prop is misleading for aspiring ML practitioners.

I have thanked @conscell already on more than one occasion for his math presentations.

I believe the MLS course series was designed to be an accessible introduction for a broad audience, including those without a strong background in math or programming. It focuses on helping learners get started with simple, practical projects. As @paulinpaloalto mentioned, the Deep Learning Specialization (DLS) covers the more advanced topics you’re interested in and is officially part of CS230 at Stanford. I recommend considering DLS as a next step after completing MLS.

To make this thread a complete yet another backpropagation tutorial, I’m adding the final piece of the derivation.
Let us derive the gradient of \mathcal{L}^{(i)} with respect to a single weight W^{[l]}_{kj}.
The pre-activation at layer l is \displaystyle z^{[l]}_j = \sum_{r=1}^{n_{l-1}} a^{[l-1]}_r W^{[l]}_{rj} + b^{[l]}_j.

Thus, \displaystyle \frac{\partial z^{[l]}_j}{\partial W^{[l]}_{kj}} = a^{[l-1]}_k. Using the chain rule we have

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}_{kj}} = \sum_{q=1}^{n_l} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_q} \frac{\partial z^{[l]}_q}{\partial W^{[l]}_{kj}}

Since z^{[l]}_q only depends on W^{[l]}_{kq}, all terms are zero except q = j:

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}_{kj}} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j} a^{[l-1]}_k.

So the element-wise form is:

\left[ \frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} \right]_{kj} = a^{[l-1]}_k \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j}

Which corresponds to

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} = {a^{[l-1]}}^\top \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}}.

For b^{[l]}_j we have

\frac{\partial z^{[l]}_j}{\partial b^{[l]}_j} = 1, \quad \frac{\partial z^{[l]}_q}{\partial b^{[l]}_j} = 0 \text{ for } q \ne j.

Then

\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}_j} = \sum_{q=1}^{n_l} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_q} \frac{\partial z^{[l]}_q}{\partial b^{[l]}_j} \quad \Rightarrow \quad \frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}_j} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j}.

So in vector form:

\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}}.
1 Like

That’s great!

Thanks a lot Pavel.

I will consume this over the next few days.

2 Likes