The pre-activation at layer l+1 is \displaystyle z^{[l+1]}_j = \sum_{r=1}^{n_l} a^{[l]}_r W^{[l+1]}_{r j} + b^{[l+1]}_j.
The activation at layer l is a^{[l]}_k = g(z^{[l]}_k).
We apply the chain rule to each scalar component
\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k}
= \sum_{j=1}^{n_{l+1}} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}.
The first term \displaystyle \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} is just the upstream gradient from layer l+1.
For the second term \displaystyle \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} from the forward pass we have
\frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} =
\sum_{r=1}^{n_l} \partial \frac{ a^{[l]}_r W^{[l+1]}_{r j} + b^{[l+1]}_j}{\partial a^{[l]}_k} = W^{[l+1]}_{k j}.
For the third term \displaystyle \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}, by definition a^{[l]}_k = g(z^{[l]}_k) \Rightarrow \displaystyle \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k} = g'(z^{[l]}_k).
Therefore,
\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k}
= \sum_{j=1}^{n_{l+1}} \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} W^{[l+1]}_{k j} \right) g'(z^{[l]}_k)
Now note: for fixed k, this is a dot product between the upstream gradient vector \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}} and the k-th row of W^{[l+1]}.
Hence, in vector form:
\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}} W^{[l+1]\top} \right) \circ g'(z^{[l]}).
1 Like
Prof. Ng uses a conceptual and intuitive approach to explain backpropagation. In contrast, the mathematics weâve been discussing is more formal, typically it is used in advanced ML/DL courses such as CS229, CS231n. By the way, your questions are part of the homework in those courses.
The reason I chose to compute the gradient of the loss first was to keep the math simple. As a side effect, it also serves as a nice illustration of the gradient accumulation technique.
However, in industry, this is implemented differently. If youâre interested in the details of how automatic differentiation works, I highly recommend watching Andrej Karpathyâs video. He is a former CS231n instructor and does a fantastic job breaking down how modern deep learning frameworks like PyTorch compute gradients under the hood. The video walks through the core ideas behind reverse-mode automatic differentiation and even builds a minimal autograd engine from scratch, which is both educational and inspiring.
1 Like
Can you take me through how you arrive at this term?
\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_k} = \sum_{j=1}^{n_{l+1}} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l+1]}_j} \frac{\partial z^{[l+1]}_j}{\partial a^{[l]}_k} \frac{\partial a^{[l]}_k}{\partial z^{[l]}_k}
The loss \mathcal{L}^{(i)} does not directly depend on z^{[l]}_k, but rather through later layers, we apply the chain rule through the next layer z^{[l+1]}.
Recall the forward pass
z^{[l+1]}_j = \sum_{r=1}^{n_l} a^{[l]}_r W^{[l+1]}_{rj} + b^{[l+1]}_j.
This means that each activation a^{[l]}_k contributes to every neuron in the next layer through the weight matrix W^{[l+1]}.
Therefore, changing z^{[l]}_k affects a^{[l]}_k = g(z^{[l]}_k), which affects all z^{[l+1]}_j, which in turn affects the loss \mathcal{L}^{(i)}. Thus, we must sum over all j in the next layer. Thatâs a direct consequence of the chain rule.
I see, thanks.
Thatâs what I suspected was going on.
Itâs been a while since I had to use calculus but I see what is happening now.
1 Like
Is there a single document that takes me through all these calculations for back prop from first principles because obviously Andrew doesnât present these in his MLS->Advanced Learning Algorithms course-> Week 3 so far anyway.
If you didnât like CS229 notes, perhaps you could try CS224n or CS231n notes. I think they are more accessible but less detailed.
Actually I havenât had a chance to view cs229 but I have now saved 229, 224n and 231n on my mac for later reading.
Thanks.
Hereâs a thread with links to lots of websites that cover these issues, including the Stanford courses that Pavel mentioned.
One of the sites linked from that thread is this one, which is worth a look. It was created by Jonas Slalin, who was a mentor for DLS for several years. It starts with a thorough treatment of Forward Propagation and then links to his other pages that cover back prop.
Thanks thatâs all very useful background information but I need to spend my time on gaining industry-standard training like MLS.
Fair enough, but note that if your goal is to be able to implement solutions in an âindustry standardâ way, then you donât need to know anything about the mathematics behind back propagation. That is because the industry standard way to implement solutions is by using an ML platform like TensorFlow, PyTorch or one of the others. In every case, back propagation is implemented for you by the platform using âauto differentiationâ techniques. So all you need is the intuitive understanding of back prop that Prof Ng has carefully designed these courses to give you. Pavel deserves a big thank you for all the effort invested to document the various aspects of the mathematics of back prop here.
Andrewâs intuitive explanation of back prop is misleading for aspiring ML practitioners.
I have thanked @conscell already on more than one occasion for his math presentations.
I believe the MLS course series was designed to be an accessible introduction for a broad audience, including those without a strong background in math or programming. It focuses on helping learners get started with simple, practical projects. As @paulinpaloalto mentioned, the Deep Learning Specialization (DLS) covers the more advanced topics youâre interested in and is officially part of CS230 at Stanford. I recommend considering DLS as a next step after completing MLS.
To make this thread a complete yet another backpropagation tutorial, Iâm adding the final piece of the derivation.
Let us derive the gradient of \mathcal{L}^{(i)} with respect to a single weight W^{[l]}_{kj}.
The pre-activation at layer l is \displaystyle z^{[l]}_j = \sum_{r=1}^{n_{l-1}} a^{[l-1]}_r W^{[l]}_{rj} + b^{[l]}_j.
Thus, \displaystyle \frac{\partial z^{[l]}_j}{\partial W^{[l]}_{kj}} = a^{[l-1]}_k. Using the chain rule we have
\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}_{kj}} = \sum_{q=1}^{n_l} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_q} \frac{\partial z^{[l]}_q}{\partial W^{[l]}_{kj}}
Since z^{[l]}_q only depends on W^{[l]}_{kq}, all terms are zero except q = j:
\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}_{kj}} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j} a^{[l-1]}_k.
So the element-wise form is:
\left[ \frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} \right]_{kj} = a^{[l-1]}_k \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j}
Which corresponds to
\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} = {a^{[l-1]}}^\top \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}}.
For b^{[l]}_j we have
\frac{\partial z^{[l]}_j}{\partial b^{[l]}_j} = 1, \quad \frac{\partial z^{[l]}_q}{\partial b^{[l]}_j} = 0 \text{ for } q \ne j.
Then
\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}_j} = \sum_{q=1}^{n_l} \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_q} \frac{\partial z^{[l]}_q}{\partial b^{[l]}_j} \quad \Rightarrow \quad \frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}_j} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}_j}.
So in vector form:
\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}} = \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}}.
1 Like
Thatâs great!
Thanks a lot Pavel.
I will consume this over the next few days.
2 Likes