W4_A1_Inconsistent cost function notation in formula 8 and 9


In the week 4 programming assignment 1, formula 8 and 9 refer to the cost function as \mathcal{J}. I suggest changing them to be \mathcal{L}, consistent with the rest of the text.

Thanks @WinniePooh. I have filed a suggestion for making them consistent.

Raymond

Edit: the symbols are correct, see my explanation here.

That is not an inconsistency. That is the definition of the notation that Prof Ng uses: it is only the gradients of W and b that are derivatives of J. All the other gradients are derivatives of something else and are just Chain Rule factors. In particular the gradients of the A^{[l]} values are derivatives of L, not J. Notice that there is no averaging taking place in the computation of dA^{[l-1]}, but an average would be required if it were the derivative of J, right?

You can argue that the notation should have been different, but that is the way that Prof Ng has done it and he is consistent in that.

Hello @WinniePooh, and @paulinpaloalto,

First, thank you Paul for your clarification. @WinniePooh, I have to withdraw the suggestion that I have filed because those symbols are correct.

Here is my version of explanation, and we need to clearly state all the shapes to see the reason behind:

image
image

Let me know if you disagree with / have questions about any of the above.

Note that there are two types of matrices:

  1. matrix for training parameters (those that doesn’t have m in their shapes)
  2. matrix for samples (those that have m in their shapes)

Our ultimate goal is to calculate (1), so let’s focus on (1) first. Each element in them is the gradient with respect to a weight, and that gradient is a summation of influences by all samples (that’s why the m disappeared because it has been summed over). Therefore, matrices of type 1 contain the cost gradients. Note that cost is the sum of losses and loss is for describing one sample.

Now, we look at (2). (2) has m in their shapes meaning that they are per-sample estimates. Take dZ^{[l]} as an example, it has, for each sample out of all m samples, n^{[l]} results because there are such number of neurons in layer l. Since matrices of type 2 are sample-based, each element in those matrices are only loss gradients. Note again that loss is for describing one sample.

I will summarize the above with the following two equations, highlighting the m-\mathcal{L} relation.

@WinnePooh, I am sorry if my previous reply has misled you. @paulinpaloalto, thank you again!

Cheers,
Raymond