Week 3, "Gradient Descent for Neural Networks"

Hello, I have a question regarding the “Formulas for Computing Derivatives” introduced around 7:11. Here, Dr. Ng mentioned that dz^[1] = W^[2]Tdz^[2] * g^[1]'(z^[1]). Can anyone explain where this came from?

Best regards,

1 Like

Hello @Juheon_Chu,

The idea is just chain-rule.

According to the left, you know how the cost depends on Z^[1]

image

then we can name the relevant derivatives:

However, we can’t just multiply them together because the chain-rule for matrices are not like the chain-rule for scalars that we have learnt in high school, but that does not stop us from finding out what each of those derivatives are:

So we basically get all of those terms needed for that final formula.

The final formula in the slide tells us

  • the correct order of those terms,
  • the need for transposing W^[2], and
  • the element-wise multiplication operator

as a result of chain-rule involving matrices.

If you have time, go through this post for an example of how matrix-based chain-rule is different from the usual scalar-based chain-rule. In fact, you will see why W has to be transposed and switch position with dZ. You can also use the same idea to prove that final formula but it is going to take some time ;).

Cheers,
Raymond

PS: You can add a backslash between ^ and [ when you type Z^[1] so that it can be displayed correctly. I corrected your post for you.

3 Likes

Hello @Juheon_Chu,

Sometimes people like to inspect the correctness of the formula by the shapes. For example, we can assign some shapes to the weights and the X:

And we copy these shapes to that formula and see if the maths should work out:

So, for example, if we didn’t transpose W, the matrix multiplication wouldn’t work. Of course, this is not a proof, and as I suggested, the proof will take some time.

Cheers,
Raymond

1 Like

Hello, @rmwkwok thanks for your explanation. I tried to follow this guideline by also forwarding to the link you provided, but unfortunately it is really hard to follow with my short level of understanding. However, I resonate with the point you made that dJ/dz[1] relates to those four different partial derivatives. Would you mind if there is any way to understand the “Matrix calculus” specifically applied to this context? I am terribly sorry for this even with your massive effort to craft this series of insightful instructions.


  1. 1 ↩︎

Hello @Juheon_Chu,

The purpose of sharing this toy example is as a reference for you to make your own example for the context you concern about.

The key idea of my example is to establish the context with

  • the equations for L, Z, W and A (step 1 & 2)
  • some small matrices W and A for easy calculation (step 1)

Step 3 essentially expands any matrix multiplication that ends up with an equation with only scalars on both side. This is important, because, with only scalars, you can use scalar calculus.

Then, without loss of generality, you can examine one of the derivative of interest (step 4), and figure out a form (step 6) that will give you consistent result with step 4.

If you google “matrix calculus”, you will see tutorial PDFs by colleges like Imperial College London and University of Minnesota. They proved some rules for derivative with matrices, but behind the proofs, one core idea is to break things down to scalars like I did in step 3.

You might not be able to find the exact rule you are looking for in those PDFs, but you can modify my toy example to create one for your own.

This is a DIY process, @Juheon_Chu :wink: If you can’t google the exact needed rule out, I think it’s time for you to work your toy example out on a piece of paper. You write everything down clearly, do the calculations step-by-step, and it will bring you to good result! That’s why we learn maths! :wink:

Feel free to share your work like the way I did in those steps, so we can discuss it. Remember to use small and simple matrices, and better if no dimension’s size is larger than 2 so that we won’t get flooded by too many numbers and symbols. :wink:

Cheers,
Raymond

Thank you! I referenced to a pdf file that you mentioned, and here is a specific part that I am quoting from (D.25, D Matrix Calculus, Imperial College London).

In this reference, I can change some variables that correspond to the equation in D.25. Then, we can let a[1] be a function of z[2], which is a function of a[3], which is a function of z[4]. Then, we can simply apply the chain rule as mentioned. I wrote my work in the below.

Do you mind asking if this approach is in a right path?

Best regards,
JC


  1. 2 ↩︎

  2. 2 ↩︎

  3. 1 ↩︎

  4. 1 ↩︎

Hi @Juheon_Chu,

I have told you this last time, but maybe I didn’t make it clear enough. Your post look like this:

and you see those “image”?

You can fix them by adding a backslash between every pair of ^ and [. For example, you type Z^\[1] and it will display Z^[1] correctly. You type Z^[1], and it will display this “image”.

@Juheon_Chu

That expression requires w, x, y, z to be vectors, but you are replacing them with matrices…

@Juheon_Chu, I can see that you were trying to find a formula to use, but your last trial has two problems that you can avoid in your future attempts:

  1. substituted some matrices when it is requiring vectors
  2. substituted two different matrices for z when z should take just one same vector. The same problem happens to w, x, and y as well.
    image

@Juheon_Chu, I do not want you to spend too much time on looking for that formula. There are a few reasons:

  1. I could not find it in that PDF.
  2. You do not need to know how to derive this formula → image to move on to the rest of this specialization, and you won’t need that to build models with Tensorflow.
  3. If you can accept a rule that you can find from that PDF or other tutorial, why can’t you just accept the formula by the lecture? I mean, if there was a rule, how much different would there be from the formula?

@Juheon_Chu, if you insist, the only suggestion I have for you is to follow my example and build you own. It will require you to know matrix multiplication and calculus. It will require you to carefully write down the problem you want to solve. It will require you to carefully work out the steps. It took me less than 10 steps in my example as you can see, and I guess your case would be just similar to mine. Maybe it will just be a 20-30 minutes of work.

@Juheon_Chu, if you start your example, we can consider that as an exercise, and we can look at it together :wink: if you share a photo of your handwriting work (I don’t suggest to type symbols here for maths steps).

However, the choice is yours. If I were you, I would either follow my example OR, if I was not comfortable with doing maths, then I would accept the lecture’s formula. The former might take 30 minutes, and the latter takes no time. In either way, you can move on.

Cheers,
Raymond

Hello @Juheon_Chu,

Besides the above three reasons, let me give you one more.

In our high school calculus, we have only one chain-rule, right? So, it is very reasonable for us to remember this rule and apply it when we need.

I suppose you were trying to do the same in matrix calculus, right?

Now, you have read that PDF, and there are many many rules. It is not one rule, but many rules. In fact, if you open this wikipedia page for matrix calculus, you will find maybe more than 100 identities (I didn’t count)?

And even there are already so many identities, the page didn’t list any matrix-by-matrix one which is what you were looking for:

image

Can we really remember those formula? Or maybe we can just accept the lecture’s formula? Or maybe we can spend 30 minutes to work our own out like by following my example?

Cheers,
Raymond

Thank you very much, I will once accept it but try this again when my proficiency increases. Thank you so much for your effort to explain this to me!

1 Like