Week 3, "Gradient Descent for Neural Networks"

Juheon_Chu · March 21, 2024, 12:11pm

Hello, I have a question regarding the “Formulas for Computing Derivatives” introduced around 7:11. Here, Dr. Ng mentioned that dz^[1] = W^[2]Tdz^[2] * g^[1]'(z^[1]). Can anyone explain where this came from?

Best regards,

rmwkwok · March 21, 2024, 12:43pm

Hello @Juheon_Chu,

The idea is just chain-rule.

According to the left, you know how the cost depends on Z^[1]

then we can name the relevant derivatives:

However, we can’t just multiply them together because the chain-rule for matrices are not like the chain-rule for scalars that we have learnt in high school, but that does not stop us from finding out what each of those derivatives are:

So we basically get all of those terms needed for that final formula.

The final formula in the slide tells us

the correct order of those terms,
the need for transposing W^[2], and
the element-wise multiplication operator

as a result of chain-rule involving matrices.

If you have time, go through this post for an example of how matrix-based chain-rule is different from the usual scalar-based chain-rule. In fact, you will see why W has to be transposed and switch position with dZ. You can also use the same idea to prove that final formula but it is going to take some time ;).

Cheers,
Raymond

PS: You can add a backslash between ^ and [ when you type Z^[1] so that it can be displayed correctly. I corrected your post for you.

rmwkwok · March 21, 2024, 1:10pm

Hello @Juheon_Chu,

Sometimes people like to inspect the correctness of the formula by the shapes. For example, we can assign some shapes to the weights and the X:

And we copy these shapes to that formula and see if the maths should work out:

So, for example, if we didn’t transpose W, the matrix multiplication wouldn’t work. Of course, this is not a proof, and as I suggested, the proof will take some time.

Cheers,
Raymond

Juheon_Chu · March 21, 2024, 2:43pm

Hello, @rmwkwok thanks for your explanation. I tried to follow this guideline by also forwarding to the link you provided, but unfortunately it is really hard to follow with my short level of understanding. However, I resonate with the point you made that dJ/dz^[1] relates to those four different partial derivatives. Would you mind if there is any way to understand the “Matrix calculus” specifically applied to this context? I am terribly sorry for this even with your massive effort to craft this series of insightful instructions.

1 ↩︎

rmwkwok · March 21, 2024, 11:09pm

Hello @Juheon_Chu,

The purpose of sharing this toy example is as a reference for you to make your own example for the context you concern about.

The key idea of my example is to establish the context with

the equations for L, Z, W and A (step 1 & 2)
some small matrices W and A for easy calculation (step 1)

Step 3 essentially expands any matrix multiplication that ends up with an equation with only scalars on both side. This is important, because, with only scalars, you can use scalar calculus.

Then, without loss of generality, you can examine one of the derivative of interest (step 4), and figure out a form (step 6) that will give you consistent result with step 4.

If you google “matrix calculus”, you will see tutorial PDFs by colleges like Imperial College London and University of Minnesota. They proved some rules for derivative with matrices, but behind the proofs, one core idea is to break things down to scalars like I did in step 3.

You might not be able to find the exact rule you are looking for in those PDFs, but you can modify my toy example to create one for your own.

This is a DIY process, @Juheon_Chu If you can’t google the exact needed rule out, I think it’s time for you to work your toy example out on a piece of paper. You write everything down clearly, do the calculations step-by-step, and it will bring you to good result! That’s why we learn maths!

Feel free to share your work like the way I did in those steps, so we can discuss it. Remember to use small and simple matrices, and better if no dimension’s size is larger than 2 so that we won’t get flooded by too many numbers and symbols.

Cheers,
Raymond

Juheon_Chu · March 22, 2024, 8:26am

Thank you! I referenced to a pdf file that you mentioned, and here is a specific part that I am quoting from (D.25, D Matrix Calculus, Imperial College London).

In this reference, I can change some variables that correspond to the equation in D.25. Then, we can let a^[1] be a function of z^[2], which is a function of a^[3], which is a function of z^[4]. Then, we can simply apply the chain rule as mentioned. I wrote my work in the below.

Do you mind asking if this approach is in a right path?

Best regards,
JC

2 ↩︎
2 ↩︎
1 ↩︎
1 ↩︎

rmwkwok · March 22, 2024, 12:14pm

Hi @Juheon_Chu,

I have told you this last time, but maybe I didn’t make it clear enough. Your post look like this:

and you see those “”?

You can fix them by adding a backslash between every pair of ^ and [. For example, you type Z^\[1] and it will display Z^[1] correctly. You type Z^[1], and it will display this “”.

rmwkwok · March 22, 2024, 12:26pm

@Juheon_Chu

That expression requires w, x, y, z to be vectors, but you are replacing them with matrices…

rmwkwok · March 22, 2024, 12:58pm

@Juheon_Chu, I can see that you were trying to find a formula to use, but your last trial has two problems that you can avoid in your future attempts:

substituted some matrices when it is requiring vectors
substituted two different matrices for z when z should take just one same vector. The same problem happens to w, x, and y as well.

@Juheon_Chu, I do not want you to spend too much time on looking for that formula. There are a few reasons:

I could not find it in that PDF.
You do not need to know how to derive this formula → to move on to the rest of this specialization, and you won’t need that to build models with Tensorflow.
If you can accept a rule that you can find from that PDF or other tutorial, why can’t you just accept the formula by the lecture? I mean, if there was a rule, how much different would there be from the formula?

@Juheon_Chu, if you insist, the only suggestion I have for you is to follow my example and build you own. It will require you to know matrix multiplication and calculus. It will require you to carefully write down the problem you want to solve. It will require you to carefully work out the steps. It took me less than 10 steps in my example as you can see, and I guess your case would be just similar to mine. Maybe it will just be a 20-30 minutes of work.

@Juheon_Chu, if you start your example, we can consider that as an exercise, and we can look at it together if you share a photo of your handwriting work (I don’t suggest to type symbols here for maths steps).

However, the choice is yours. If I were you, I would either follow my example OR, if I was not comfortable with doing maths, then I would accept the lecture’s formula. The former might take 30 minutes, and the latter takes no time. In either way, you can move on.

Cheers,
Raymond

rmwkwok · March 22, 2024, 1:32pm

Hello @Juheon_Chu,

Besides the above three reasons, let me give you one more.

In our high school calculus, we have only one chain-rule, right? So, it is very reasonable for us to remember this rule and apply it when we need.

I suppose you were trying to do the same in matrix calculus, right?

Now, you have read that PDF, and there are many many rules. It is not one rule, but many rules. In fact, if you open this wikipedia page for matrix calculus, you will find maybe more than 100 identities (I didn’t count)?

And even there are already so many identities, the page didn’t list any matrix-by-matrix one which is what you were looking for:

Can we really remember those formula? Or maybe we can just accept the lecture’s formula? Or maybe we can spend 30 minutes to work our own out like by following my example?

Cheers,
Raymond

Juheon_Chu · March 25, 2024, 12:30am

Thank you very much, I will once accept it but try this again when my proficiency increases. Thank you so much for your effort to explain this to me!

Topic		Replies	Views
Clarification grad. descent Neural Networks and Deep Learning coursera-platform	2	539	May 25, 2021
Week 3 - Please explain how we got to this backward propagation result? Neural Networks and Deep Learning coursera-platform	6	721	February 12, 2023
W2_A1_Calculating gradient descent with variables Dw and db Neural Networks and Deep Learning coursera-platform	5	1026	December 8, 2023
Back propagation why do we start from dZ2 and why transpose Neural Networks and Deep Learning week-3 , coursera-platform	2	332	May 30, 2024
Week 3: computing derivatives for shallow network Neural Networks and Deep Learning coursera-platform	2	682	January 26, 2022

Week 3, "Gradient Descent for Neural Networks"

Related topics