any one understand why the calculation of the derivatives:

da/dz is a(1-a) ?

and then also how dL/dw1, dL/dw2, dL/db are calculated ?

Here is the answer. Also, check this YouTube playlist of Eddy Shyu. Maybe @paulinpaloalto will add some more stuff to read.

Found the answer for dL/dz in the optional material: Derivation of DL/dz

but feel free to comment on the calculus of dL/dw1, dL/dw2 and dL/db

Here’s another thread with links to material about the derivations of the backprop formulas. It also points to this thread about matrix calculus in general that is helpful. As you’d expect, matrix calculus is based on the same principles as univariate calculus, but things get more complicated with more dimensions beyond even the normal partial derivative notions.

Paul sir! Please correct me if I am wrong.

Suppose we have a three-layer model (2 hidden and 1 output). The chain-rule for dZ1, dW1, and db1 are:

\frac{dL}{dZ1} = \frac{dL}{dA3} \times \frac{dA3}{dZ3}\times \frac{dZ3}{dA2}\times \frac{dA2}{dZ2}\times \frac{dA1}{dZ1}

\frac{dL}{dW1} = \frac{dL}{dA3} \times \frac{dA3}{dZ3}\times \frac{dZ3}{dA2}\times \frac{dA2}{dZ2}\times \frac{dZ2}{dA1}\times \frac{dA1}{dZ1}\times\frac{dZ1}{dW1}

\frac{dL}{db1} = \frac{dL}{dA3} \times \frac{dA3}{dZ3}\times \frac{dZ3}{dA2}\times \frac{dA2}{dZ2}\times \frac{dZ2}{dA1}\times \frac{dA1}{dZ1}\times\frac{dZ1}{db1}

In dW1, we do not take derivative w.r.t. any other weights like W2 or W3, right? Same for b.

Hey @saifkhanengr,

Although not related to your query, I believe that your first equation has one missing term. According to me, it should be as follows:

\frac{dL}{dZ1} = \frac{dL}{dA3} \times \frac{dA3}{dZ3}\times \frac{dZ3}{dA2}\times \frac{dA2}{dZ2}\times \frac{dZ2}{dA1} \times \frac{dA1}{dZ1}

\frac{dL}{dW1} = \frac{dL}{dA3} \times \frac{dA3}{dZ3}\times \frac{dZ3}{dA2}\times \frac{dA2}{dZ2}\times \frac{dZ2}{dA1}\times \frac{dA1}{dZ1}\times\frac{dZ1}{dW1}

\frac{dL}{db1} = \frac{dL}{dA3} \times \frac{dA3}{dZ3}\times \frac{dZ3}{dA2}\times \frac{dA2}{dZ2}\times \frac{dZ2}{dA1}\times \frac{dA1}{dZ1}\times\frac{dZ1}{db1}

And as for your query, you are correct indeed. For computing dW_1, we do not take the derivative wrt any other weights or biases. Same goes for db_1. The reason is simple too.

Consider Z_2 = W_2^T A_1 + b_2. For computing \frac{dL}{dW1}, we need \frac{dZ1}{dW1}, and since from Z1, we get A1, we also need \frac{dA1}{dZ1}. Now, here if we take \frac{dZ2}{dW2} instead of \frac{dZ2}{dA1}, there will be a derivative mismatch, and the back-propagation won’t work. I hope this resolves your issue.

Cheers,

Elemento

Oh, thanks for catching the missing term, Elemento. And thanks for clarifying my doubts.

hey thanks @saifkhanengr I understan for da/dz now,

but is there any explanation for dL/dw1, dL/dw2, dL/db ?

Elemento gives us the correct equations. For the rest equations, you can get sense from them.

And if you want to dig deeper you can follow the links that I gave earlier on this thread.

Hi all,

I finally found an article that is also explaining the derivatives wrt w1, w2 and b

https://vincentblog.xyz/posts/backpropagation-and-gradient-descent

Feel free to comments

Alexis

Hi all,

I just noticed that in the link https://vincentblog.xyz/posts/backpropagation-and-gradient-descent

the first layer is also using a Sigmoid which is not the case in the course (tanh)