W4_A1_Inconsistent cost function notation in formula 8 and 9

WinniePooh · January 16, 2023, 9:57pm

In the week 4 programming assignment 1, formula 8 and 9 refer to the cost function as \mathcal{J}. I suggest changing them to be \mathcal{L}, consistent with the rest of the text.

rmwkwok · January 17, 2023, 2:26am

Thanks @WinniePooh. ~~I have filed a suggestion for making them consistent.~~

Raymond

Edit: the symbols are correct, see my explanation here.

paulinpaloalto · January 17, 2023, 4:11am

That is not an inconsistency. That is the definition of the notation that Prof Ng uses: it is only the gradients of W and b that are derivatives of J. All the other gradients are derivatives of something else and are just Chain Rule factors. In particular the gradients of the A^{[l]} values are derivatives of L, not J. Notice that there is no averaging taking place in the computation of dA^{[l-1]}, but an average would be required if it were the derivative of J, right?

You can argue that the notation should have been different, but that is the way that Prof Ng has done it and he is consistent in that.

rmwkwok · January 18, 2023, 2:47am

Hello @WinniePooh, and @paulinpaloalto,

First, thank you Paul for your clarification. @WinniePooh, I have to withdraw the suggestion that I have filed because those symbols are correct.

Here is my version of explanation, and we need to clearly state all the shapes to see the reason behind:

Let me know if you disagree with / have questions about any of the above.

Note that there are two types of matrices:

matrix for training parameters (those that doesn’t have m in their shapes)
matrix for samples (those that have m in their shapes)

Our ultimate goal is to calculate (1), so let’s focus on (1) first. Each element in them is the gradient with respect to a weight, and that gradient is a summation of influences by all samples (that’s why the m disappeared because it has been summed over). Therefore, matrices of type 1 contain the cost gradients. Note that cost is the sum of losses and loss is for describing one sample.

Now, we look at (2). (2) has m in their shapes meaning that they are per-sample estimates. Take dZ^{[l]} as an example, it has, for each sample out of all m samples, n^{[l]} results because there are such number of neurons in layer l. Since matrices of type 2 are sample-based, each element in those matrices are only loss gradients. Note again that loss is for describing one sample.

I will summarize the above with the following two equations, highlighting the m-\mathcal{L} relation.

@WinnePooh, I am sorry if my previous reply has misled you. @paulinpaloalto, thank you again!

Cheers,
Raymond

Topic		Replies	Views
Week 2, Exercise 6 -- Error Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	571	February 3, 2022
Possible typo (missing 1/m) Neural Networks and Deep Learning coursera-platform	3	595	August 21, 2022
Week 3, "Gradient Descent for Neural Networks" Neural Networks and Deep Learning week-module-3 , coursera-platform	10	473	March 25, 2024
In video: terrible abuse of notation with "dw" denoting the gradient ... there is no need for that! Neural Networks and Deep Learning week-module-2 , coursera-platform	4	30	January 6, 2025
W4_A1_Inconsistent W and b notation Neural Networks and Deep Learning coursera-platform	1	505	January 16, 2023

W4_A1_Inconsistent cost function notation in formula 8 and 9

Related topics