Back Prop question

conscell · May 25, 2025, 10:10am

The pre-activation at layer l+1 is \displaystyle z^{[l+1]}_j = \sum_{r=1}^{n_l} a^{[l]}_r W^{[l+1]}_{r j} + b^{[l+1]}_j.
The activation at layer l is a^{[l]}_k = g(z^{[l]}_k).
We apply the chain rule to each scalar component

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}.

The first term \displaystyle \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} is just the upstream gradient from layer l+1.
For the second term \displaystyle \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} from the forward pass we have

\frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} = \sum_{r=1}^{n_l} \partial \frac{ a^{[l]}_r W^{[l+1]}_{r j} + b^{[l+1]}_j}{\partial a^{[l]}_k} = W^{[l+1]}_{k j}.

For the third term \displaystyle \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}, by definition a^{[l]}_k = g(z^{[l]}_k) \Rightarrow \displaystyle \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k} = g'(z^{[l]}_k).

Therefore,

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} W^{[l+1]}_{k j} \right) g'(z^{[l]}_k)

Now note: for fixed k, this is a dot product between the upstream gradient vector \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}} and the k-th row of W^{[l+1]}.

Hence, in vector form:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}} W^{[l+1]\top} \right) \circ g'(z^{[l]}).

conscell · May 25, 2025, 11:23am

Prof. Ng uses a conceptual and intuitive approach to explain backpropagation. In contrast, the mathematics we’ve been discussing is more formal, typically it is used in advanced ML/DL courses such as CS229, CS231n. By the way, your questions are part of the homework in those courses.
The reason I chose to compute the gradient of the loss first was to keep the math simple. As a side effect, it also serves as a nice illustration of the gradient accumulation technique.
However, in industry, this is implemented differently. If you’re interested in the details of how automatic differentiation works, I highly recommend watching Andrej Karpathy’s video. He is a former CS231n instructor and does a fantastic job breaking down how modern deep learning frameworks like PyTorch compute gradients under the hood. The video walks through the core ideas behind reverse-mode automatic differentiation and even builds a minimal autograd engine from scratch, which is both educational and inspiring.

ai_is_cool · May 26, 2025, 8:57am

Can you take me through how you arrive at this term?

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}

conscell · May 26, 2025, 12:04pm

The loss \mathcal{L}^{(i)} does not directly depend on z^{[l]}_k, but rather through later layers, we apply the chain rule through the next layer z^{[l+1]}.
Recall the forward pass

z^{[l+1]}_j = \sum_{r=1}^{n_l} a^{[l]}_r W^{[l+1]}_{rj} + b^{[l+1]}_j.

This means that each activation a^{[l]}_k contributes to every neuron in the next layer through the weight matrix W^{[l+1]}.
Therefore, changing z^{[l]}_k affects a^{[l]}_k = g(z^{[l]}_k), which affects all z^{[l+1]}_j, which in turn affects the loss \mathcal{L}^{(i)}. Thus, we must sum over all j in the next layer. That’s a direct consequence of the chain rule.

ai_is_cool · May 26, 2025, 12:54pm

I see, thanks.

That’s what I suspected was going on.

It’s been a while since I had to use calculus but I see what is happening now.

ai_is_cool · May 26, 2025, 1:29pm

Is there a single document that takes me through all these calculations for back prop from first principles because obviously Andrew doesn’t present these in his MLS->Advanced Learning Algorithms course-> Week 3 so far anyway.

conscell · May 26, 2025, 1:54pm

If you didn’t like CS229 notes, perhaps you could try CS224n or CS231n notes. I think they are more accessible but less detailed.

ai_is_cool · May 26, 2025, 2:06pm

Actually I haven’t had a chance to view cs229 but I have now saved 229, 224n and 231n on my mac for later reading.

Thanks.

paulinpaloalto · May 26, 2025, 2:48pm

Here’s a thread with links to lots of websites that cover these issues, including the Stanford courses that Pavel mentioned.

One of the sites linked from that thread is this one, which is worth a look. It was created by Jonas Slalin, who was a mentor for DLS for several years. It starts with a thorough treatment of Forward Propagation and then links to his other pages that cover back prop.

ai_is_cool · May 26, 2025, 6:16pm

Thanks that’s all very useful background information but I need to spend my time on gaining industry-standard training like MLS.

paulinpaloalto · May 26, 2025, 7:28pm

Fair enough, but note that if your goal is to be able to implement solutions in an “industry standard” way, then you don’t need to know anything about the mathematics behind back propagation. That is because the industry standard way to implement solutions is by using an ML platform like TensorFlow, PyTorch or one of the others. In every case, back propagation is implemented for you by the platform using “auto differentiation” techniques. So all you need is the intuitive understanding of back prop that Prof Ng has carefully designed these courses to give you. Pavel deserves a big thank you for all the effort invested to document the various aspects of the mathematics of back prop here.

ai_is_cool · May 26, 2025, 9:10pm

Andrew’s intuitive explanation of back prop is misleading for aspiring ML practitioners.

I have thanked @conscell already on more than one occasion for his math presentations.

conscell · May 26, 2025, 11:45pm

I believe the MLS course series was designed to be an accessible introduction for a broad audience, including those without a strong background in math or programming. It focuses on helping learners get started with simple, practical projects. As @paulinpaloalto mentioned, the Deep Learning Specialization (DLS) covers the more advanced topics you’re interested in and is officially part of CS230 at Stanford. I recommend considering DLS as a next step after completing MLS.

conscell · May 27, 2025, 12:23am

To make this thread a complete yet another backpropagation tutorial, I’m adding the final piece of the derivation.
Let us derive the gradient of \mathcal{L}^{(i)} with respect to a single weight W^{[l]}_{kj}.
The pre-activation at layer l is \displaystyle z^{[l]}_j = \sum_{r=1}^{n_{l-1}} a^{[l-1]}_r W^{[l]}_{rj} + b^{[l]}_j.

Thus, \displaystyle \frac{\partial z^{[l]}_j}{\partial W^{[l]}_{kj}} = a^{[l-1]}_k. Using the chain rule we have

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}_{kj}} = \sum_{q=1}^{n_l} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_q} \frac{\partial z^{[l]}_q}{\partial W^{[l]}_{kj}}

Since z^{[l]}_q only depends on W^{[l]}_{kq}, all terms are zero except q = j:

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}_{kj}} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j} a^{[l-1]}_k.

So the element-wise form is:

\left[ \frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} \right]_{kj} = a^{[l-1]}_k \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j}

Which corresponds to

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} = {a^{[l-1]}}^\top \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}}.

For b^{[l]}_j we have

\frac{\partial z^{[l]}_j}{\partial b^{[l]}_j} = 1, \quad \frac{\partial z^{[l]}_q}{\partial b^{[l]}_j} = 0 \text{ for } q \ne j.

Then

\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}_j} = \sum_{q=1}^{n_l} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_q} \frac{\partial z^{[l]}_q}{\partial b^{[l]}_j} \quad \Rightarrow \quad \frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}_j} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j}.

So in vector form:

\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}}.

ai_is_cool · May 27, 2025, 12:14pm

That’s great!

Thanks a lot Pavel.

I will consume this over the next few days.

Topic		Replies	Views
Feedforward Neural Networks in Depth Deep Learning Resources coursera-platform	69	103824	September 20, 2025
Backward propagation derivation Neural Networks and Deep Learning week-module-3 , coursera-platform	23	245	March 1, 2025
Course 1: Week 3 (backpropagation intuition) Neural Networks and Deep Learning coursera-platform	21	5611	April 27, 2022
Course 1 - Week 4 - 1/m in backpropagation Neural Networks and Deep Learning coursera-platform	12	682	April 29, 2024
Week 4 Exercise 9 - Backpropagation, L_model Neural Networks and Deep Learning coursera-platform	4	787	August 11, 2022

Back Prop question

Related topics