Week 3, "Gradient Descent for Neural Networks"

Hello @Juheon_Chu,

The idea is just chain-rule.

According to the left, you know how the cost depends on Z^[1]

image

then we can name the relevant derivatives:

However, we can’t just multiply them together because the chain-rule for matrices are not like the chain-rule for scalars that we have learnt in high school, but that does not stop us from finding out what each of those derivatives are:

So we basically get all of those terms needed for that final formula.

The final formula in the slide tells us

  • the correct order of those terms,
  • the need for transposing W^[2], and
  • the element-wise multiplication operator

as a result of chain-rule involving matrices.

If you have time, go through this post for an example of how matrix-based chain-rule is different from the usual scalar-based chain-rule. In fact, you will see why W has to be transposed and switch position with dZ. You can also use the same idea to prove that final formula but it is going to take some time ;).

Cheers,
Raymond

PS: You can add a backslash between ^ and [ when you type Z^[1] so that it can be displayed correctly. I corrected your post for you.

3 Likes