Delta Loss question

I see that the delta loss is expressed as a matrix (1, m) in Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

Why it’s not expressed as a single number? Like:

dAL = np.sum(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) / -m

Why I’m asking this? I saw that when we compute dW and db, the 1/m appears there:

dW = dZ @ A_prev.T * 1/m
dW = np.sum(dZ) /m

At least do I understand correctly that if we would initially divide dA in the last layer by m, we wouldn’t need to do that later?

I.e. there is some reason in why we need to compute dW and db for each dataset sample separately and average them only in a last step

Prof Ng’s notation is a bit ambiguous. You have to watch the context to figure out whether the d… value is a final gradient or merely a “Chain Rule” factor being used to compute a final gradient. It is only dW^{[l]} and db^{[l]} which are actually the full gradients meaning that they are partial derivatives of J w.r.t. the parameter in question. So those are the only ones that are averages over the samples. The rest are vectors (pre average). In your particular example:

dAL = \displaystyle \frac {\partial L}{\partial AL}

So it is a vector of dimension 1 x m. Remember that J is the average of the loss values L across the samples. Of course the derivative of the average is the average of the derivatives. Think about it for a sec and that should make sense.

1 Like

Here’s a thread from Eddy that has derivations of some of the relevant quantities.

1 Like