Week 3: Why dZ^[1] = W^[2]T dZ^[2] * g^[1]'(Z^[1])

yujin_lee2 · February 12, 2023, 8:28am

I think dZ[1] = dZ[2] W[2]T * g[1]'(Z[1]).

What’s wrong with my calculations?

chain rule? derivative matrix?

AbdElRhaman_Fakhry · February 12, 2023, 4:26pm

Hi @yujin_lee2
Welcome to the community!

The equations in this image

is only for the last layer to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and the activation function here is softmax if it isn’t softmax these equations will be change as it is brief.

To get the original equation of how to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and from they came is

dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}}
dW^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial W^{[l]}}
dB^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial B^{[l]}}.

That’s called chain rule in the derivatives so that here we remove the step of calculating da^{[l]}

But if the last layer isn’t softmax we should change this equations to calculate the appropriate equations according to the activation function for example if the last layer activation function is sigmoid we should calculate the

dAL = \frac{\partial \mathcal{L^{[L]}}}{\partial A^{[L]}} = -\frac{y^{[1]}}{a^{[1]}} +\frac{1-y^{[1]}}{1-a^{[1]}} ...-\frac{y^{[m]}}{a^{[m]}} +\frac{1-y^{[m]}}{1-a^{[m]}}

and calculate that dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial Z^{[l]}} according to( dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} ) if you want the equation of the dZ^{[l]} of the sigmoid function will be dZ^{[l]} =\frac{\partial\mathcal{L^{[l]}} }{\partial Z^{[l]}} = A^{1}(1-A^{1}) ... A^{m}(1-A^{m})

so concretely we must have dA^{[ for \ each \ layer \ except \ layer \ 0]} to get all ( dZ^{[l]},dW^{[l]}, db^{[l]}) by chain rule
Also
If the activation function of the layer is changed the equations also changed according to this image

Cheers,
Abdelrahman

paulinpaloalto · February 12, 2023, 5:11pm

Well, notice the dimensions on dZ^{[2]} and W^{[2]T}: they are n^{[2]} x m and n^{[1]} x n^{[2]} respectively, right? So that matrix multiply (dot product style) in the formula you show is not going to work. But it does work if the formula is as Prof Ng and AbdElRhaman show it.

Taking derivatives is a bit more complicated when you’re working with matrices than scalars. This is beyond the scope of this course (by design), but you can find links to background information on matrix calculus and on these specific derivations on this thread.

rmwkwok · February 13, 2023, 3:22am

Hello @yujin_lee2,

As @paulinpaloalto suggested, we often resort to checking the matrices’ indices to make sure all symbols are in order. As he also pointed out, we can’t always do matrix calculus in the way we do scalar calculus. As wikipedia very well summaried:

Note that exact equivalents of the scalar product rule and chain rule do not exist when applied to matrix-valued functions of matrices.

We can do a simple exercise to verify that indeed we can’t use chain rule like that. Let’s say we define matrices Z, W, A and a scalar L this way,

Obviously,

Can our scalar “chain rule” recovers the same result?

The answer is no. But the following will do:

Lastly, in my step (5), I said “this is wrong”, because a matrix-by-matrix derivative like that won’t result in a 2x2 matrix. Because as the same wikipedia page says:

Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable.

So to correct that wrong thing, it is going to be a tensor that organizes 16 derivative results.

I think the wikipedia page has a lot of useful examples if you want to dig deeper.

Cheers,
Raymond

Topic		Replies	Views
The intuition of db^[l]=dz^[l] and da^[l-1]=w^[l-1].dz^[l] Neural Networks and Deep Learning coursera-platform	4	789	May 27, 2023
How we got derivative of dz[1]=w[2]T.dz[2]*g[1]`(z[1]) Neural Networks and Deep Learning week-module-3 , coursera-platform	1	232	May 7, 2024
W2_A1_Calculating gradient descent with variables Dw and db Neural Networks and Deep Learning coursera-platform	5	1027	December 8, 2023
W3_A1_Derivative for hidden neural layers (Backprop) Neural Networks and Deep Learning coursera-platform	5	608	February 9, 2023
BackPropagation Derivation Of 2 Layer Neural Network Neural Networks and Deep Learning week-module-3 , coursera-platform	1	246	March 3, 2024

Week 3: Why dZ^[1] = W^[2]T dZ^[2] * g^[1]'(Z^[1])

Related topics