dW for L2 regularization

As shown in the screenshot above, the updated cost function for L2 regularization is equal to the old cost function plus a regularized term. From the regularized term, let’s focus on the summation part. It is summation of squared-frobenius-norm of all weight matrices (weights of all layers), which is a scalar. It’s fine till there.

What get’s confusing is the derivative. The derivative has a “from backprop part” and differentiated regularized term. The differentiated regularized term is the weight matrix or the sum of all elements of a weight matrix? Because the way i differentiated it, it gave me a sum of elements of weight matrix of one layer.

Because the way i differentiated it, it gave me a sum of elements of weight matrix of one layer.

I think your are right. W is slightly confusing, since Andrew sometimes used it for stacking the weights in one layer. But, Andrew re-define W as follows in this slide.

{\parallel W^{[l]}\parallel}^{2}_{F} = \sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}{(w_{ij}^{[l]})}^2

So, the result is as you wrote,

\frac{\lambda}{m} \sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}{(w_{ij}^{[l]})}

In week 1’s assignment - regularization, I tried with:

dW1 = 1/m * np.dot(dZ1, X.T) + (lambd * np.sum(W1)) / m

This fails the tests provided in the notebook.

But with this:

dW1 = 1/m * np.dot(dZ1, X.T) + (lambd * W1) / m

it passes.
Which one is correct?

That’s great suggestion. Now, I recall my old math and derivative of Trace…

I was totally wrong. Very sorry for making you confused further.
It was a derivative of Frobenius Norm… Let me correct my math. It must be;

{\parallel W^{[l]}\parallel_F}^2 =Tr({W^{[l]}}^T W^{[l]})
\frac{\partial}{\partial W}{\parallel W^{[l]}\parallel_F}^2 =\frac{\partial}{\partial W}Tr({W^{[l]}}^T W^{[l]}) =\frac{\partial}{\partial W}Tr(W^{[l]} {W^{[l]}}^T) = 2W^{[l]}

So, derivative of regularization term is \frac{\lambda}{m}W. It’s weights matrices.
I appreciate your pointing out. Otherwise, I did not recall derivatives of Trace…

1 Like

Best explanation. It did not occured to me that we can convert this to trace form.
Thanks :raised_hands:

Thank you very much.