I think dZ[1] = dZ[2] W[2]T * g[1]'(Z[1]).
What’s wrong with my calculations?
chain rule? derivative matrix?
I think dZ[1] = dZ[2] W[2]T * g[1]'(Z[1]).
What’s wrong with my calculations?
chain rule? derivative matrix?
Hi @yujin_lee2
Welcome to the community!
The equations in this image
is only for the last layer to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and the activation function here is softmax if it isn’t softmax these equations will be change as it is brief.
To get the original equation of how to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and from they came is
dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}}
dW^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial W^{[l]}}
dB^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial B^{[l]}}.
That’s called chain rule in the derivatives so that here we remove the step of calculating da^{[l]}
But if the last layer isn’t softmax we should change this equations to calculate the appropriate equations according to the activation function for example if the last layer activation function is sigmoid we should calculate the
dAL = \frac{\partial \mathcal{L^{[L]}}}{\partial A^{[L]}} = -\frac{y^{[1]}}{a^{[1]}} +\frac{1-y^{[1]}}{1-a^{[1]}} ...-\frac{y^{[m]}}{a^{[m]}} +\frac{1-y^{[m]}}{1-a^{[m]}}
and calculate that dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial Z^{[l]}} according to( dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} ) if you want the equation of the dZ^{[l]} of the sigmoid function will be dZ^{[l]} =\frac{\partial\mathcal{L^{[l]}} }{\partial Z^{[l]}} = A^{1}(1-A^{1}) ... A^{m}(1-A^{m})
so concretely we must have dA^{[ for \ each \ layer \ except \ layer \ 0]} to get all ( dZ^{[l]},dW^{[l]}, db^{[l]}) by chain rule
Also
If the activation function of the layer is changed the equations also changed according to this image
Cheers,
Abdelrahman
Well, notice the dimensions on dZ^{[2]} and W^{[2]T}: they are n^{[2]} x m and n^{[1]} x n^{[2]} respectively, right? So that matrix multiply (dot product style) in the formula you show is not going to work. But it does work if the formula is as Prof Ng and AbdElRhaman show it.
Taking derivatives is a bit more complicated when you’re working with matrices than scalars. This is beyond the scope of this course (by design), but you can find links to background information on matrix calculus and on these specific derivations on this thread.
Hello @yujin_lee2,
As @paulinpaloalto suggested, we often resort to checking the matrices’ indices to make sure all symbols are in order. As he also pointed out, we can’t always do matrix calculus in the way we do scalar calculus. As wikipedia very well summaried:
Note that exact equivalents of the scalar product rule and chain rule do not exist when applied to matrix-valued functions of matrices.
We can do a simple exercise to verify that indeed we can’t use chain rule like that. Let’s say we define matrices Z, W, A and a scalar L this way,
Obviously,
Can our scalar “chain rule” recovers the same result?
The answer is no. But the following will do:
Lastly, in my step (5), I said “this is wrong”, because a matrix-by-matrix derivative like that won’t result in a 2x2 matrix. Because as the same wikipedia page says:
Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable.
So to correct that wrong thing, it is going to be a tensor that organizes 16 derivative results.
I think the wikipedia page has a lot of useful examples if you want to dig deeper.
Cheers,
Raymond