Derivative of regularized logistic cost function-- does it need the DIMENSION of the w vector?

Video with derivative of logistic regression cost function

I was taking a look at the derivative for the regularization term in the cost function in this video,

J_{reg}=\frac{\lambda}{2 m}\sum_{j=1}^n w_j^2

To find the gradient, we need to take the derivative, and I was taking a look at that myself. When I take the derivative, I find it to be

\frac{n\lambda}{m}\omega_j

The important difference is the n

Let’s ignore the \frac{\lambda}{2m} in the cost function and multiply by that at the end.

Here is how I arrived at that answer. Repeated indices use the Einstein summation convention Einstein notation and the delta_{ij} symbol is the Kroneker delta function Kronecker delta function which is one when i and j are equal and zero otherwise.

\frac{\partial}{\partial \omega_j}\sum_{k=1}^n \omega_k^2
=\frac{\partial}{\partial \omega_j}\left[\delta_{ik}(\omega_i \omega_k)\right]

This has to be a \delta_{ik} not a \delta_{ij} because it makes the \omega^2 term due to a sum over the i and k indicies. For example, \sum_i \sum_{j, where j=i} x_i y_j =\delta_{ij}x_iy_j This is the same as \sum_j \omega_j^2=\sum_k\sum_{i,i=k}\omega_i\omega_k =\delta_{ik}\omega_i\omega_k.

=\delta_{ik}\frac{\partial}{\partial\omega_j}(\omega_i\omega_k)
=\delta_{ik}\left[\frac{\partial\omega_i}{\partial\omega_j}\omega_k+\omega_i\frac{\partial\omega_k}{\partial\omega_j}\right]

Here it’s worth noting that

\frac{\partial \omega_\alpha}{\partial\omega_\beta}=\delta_{\alpha\beta}
for any \alpha and \beta in i,j,k

So returning to the gradient of the cost regularization term:

=\delta_{ik}\left[\delta_{ij}\omega_k+\omega_i\delta_{kj}\right]

=\delta_{ik}\delta_{ij}\omega_k+\delta_{ik}\delta_{kj}\omega_i

For any pair of kronecker deltas:
\delta_{\alpha\gamma}\delta_{\gamma\beta}=N\delta_{\alpha\beta} where N is the number of dimensions that the indices can have. Here, N = n.

For any kronecker delta,
\delta_{\alpha\beta}=\delta_{\beta\alpha}
Kronecker delta functions are symmetric

So I rewrote the last step as

=\delta_{ki}\delta_{ij}\omega_k+\delta_{ik}\delta_{kj}\omega_i
=n\delta_{kj}\omega_k+n\delta_{ij}\omega_i

But recall that a kronecker delta is one if the indices are equal and zero otherwise. So

\delta_{\alpha\beta}\omega_\beta=\omega_\alpha

There’s no factor of n here because there’s only one index \beta that matches \alpha, rather than needing to sum over an entire set of repeated indices \gamma.

In the final two steps:

=n\omega_j+n\omega_j
=2n\omega_j

Okay now let’s combine with the $\frac{\lambda}{2m} from the cost function.

=\frac{n\lambda}{m}\omega_j

I’m wondering why the choice to regularize by m, the number of dimensions in the training set, rather than by n, the number of parameters in the model?

Thanks,
Steven

Hi @s-dorsher,

I believe there is a mistake in your derivation. For the regularization term \displaystyle J_{\rm reg} = {\lambda \over 2m} \sum_{j = 1}^n w_j^2 we want to compute the derivative with respect to a specific weight w_j:

{\partial J_{\rm reg} \over \partial w_j} = {\lambda \over 2m} \left( {\partial\over \partial w_j} \sum_{j = 1}^n w_j^2 \right) = {\lambda \over 2m} \left( {\partial\over \partial w_j} \sum_{k = 1}^n w_k^2 \right) = {\lambda \over 2m} \sum_{k = 1}^n {\partial\over \partial w_j} w_k^2 \\ = {\lambda \over 2m} \left( {\partial\over \partial w_j} w_1^2 + \dots + {\partial\over \partial w_j} w_j^2 + \dots + {\partial\over \partial w_j} w_n^2 \right) \\ = {\lambda \over 2m} \left( 0 + \dots + 2 w_j + \dots + 0 \right) = {\lambda \over m} w_j.

Please notice, that I changed the summation index from j to k in the sum to avoid confusion, as the partial derivative is taken with respect to w_j, and using the same index in both the sum and the derivative would lead to ambiguity.

Dividing by m helps to ensure that the gradient step size due to the regularization is of similar scale to the gradient of the loss function.

2 Likes

This is an interesting question that has come up before. I don’t know a definitive answer, but one idea would be that the point is to make the choice of the hyperparameter \lambda orthogonal to the size of your training set. I have not taken MLS, so I don’t know what Prof Ng says there about regularization, but when he discusses handling overfitting in DLS Course 2, he mentions that one of first choices for next steps when you have an overfitting problem is to get more training data that covers a wider statistical spectrum. In the limit, as m \rightarrow \infty, then your L2 term will go to zero with the factor of \frac {1}{m}.

Of course getting more training data is not always an easy or inexpensive thing to do, so then you consider regularization. But maybe you can do a hybrid strategy and get some more data and also apply L2 regularization. In that case, having the \lambda value be orthogonal to your training set size might be useful.

Prof. Ng in this video (at 4:30) said: “A couple of things I would like to point out by convention, instead of using lambda times the sum of w_j squared. We also divide lambda by 2m so that both the 1st and 2nd terms here are scaled by 1 \over 2m. It turns out that by scaling both terms the same way it becomes a little bit easier to choose a good value for lambda. And in particular you find that even if your training set size growth, say you find more training examples. So m the training set size is now bigger. The same value of lambda that you’ve picked previously is now also more likely to continue to work if you have this extra scaling by 2m.”

In modern frameworks such as TensorFlow and PyTorch, L2 regularization is implemented as weight decay and it doesn’t take into account the batch size:

  1. TensorFlow
    The L2 regularization penalty is computed as: loss = l2 * reduce_sum(square(x)).
    dense = Dense(3, kernel_regularizer= tf.keras.regularizers.L2(l2=0.01)

  2. PyTorch
    It is built into the optimizer: torch.optim.SGD(params, lr=0.001, weight_decay=0.01).
    weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

CORRECTING IT FROM HERE DOWN

So since this doesn’t format well when quoted, and I’m new and don’t know how to fix it, the last equation was:

=\delta_{ik}\delta_{ij}\omega_k+\delta_{ik}\delta_{kj}\omega_i

The correct logic at this point is to say, the \delta_{ik} on the first term changes the \omega_k to an \omega_i. The \delta_{ik} on the second term changes the \omega_i to and \omega_k because a delta function indicates the indices are equal

So the equation now reads
=\delta_{ij}\omega_i+\delta_{kj}\omega_k

The reason this was different than the logic outlined in the original post is that the \omega's have specific individual indices, rather than a sum over indices, so we are index-matching, rather than summing in total.

So again, we index match. In the first term, \omega_i becomes \omega_j because of the \delta_{ij} and in the second term, \omega_k becomes \omega_j because of the \delta_{kj}

So at this point, we have

=\omega_j+\omega_j

Which is

=2\omega_j

When including \frac{\lambda}{2m}, we get

=\frac{2\omega_j\lambda}{2m}=\frac{\lambda}{m}\omega_j

Thank you so much for the help correcting this @conscell Your work was absolutely correct and very helpful!

Steven

1 Like

@s-dorsher,

I am not really familiar with Einstein summation convention as you are. Why don’t we represent the regularization term as J_{\rm reg} = {\lambda \over 2m} \sum_{\alpha = 1}^n w_\alpha^2 = {\lambda \over 2m} w_\alpha w_\alpha? We set \partial_j = {\partial \over \partial w_j}. Then for \partial_j J_{\rm reg} we have {\lambda \over 2m} \partial_j (w_\alpha w_\alpha) = {\lambda \over 2m} \cdot 2 w_\alpha \delta_{\alpha j} = {\lambda \over m} w_\alpha \delta_{\alpha j} = {\lambda \over m} w_j.

Sure that works. It wouldn’t have worked according to my original mistake, but it definitely works as you have fixed it, and with me fixing my original error. Thank you for the shortened and improved version. :slight_smile:

1 Like

I fixed this part (an hour or two ago) as well, and I’m not sure why I had made this mistake, but it is what led to the wrong answer, and the more lengthy form.

@s-dorsher,

If you are familiar with matrix calculus, it can be even shorter: {\partial \over {\bf \partial w}} J_{\rm reg} = {\lambda \over 2m} {\partial {\bf w}^\top {\bf w} \over \partial {\bf w}} = {\lambda \over 2m} 2{\bf w} = {\lambda \over m} {\bf w}.

@conscell

I am. However, as you can see, in the matrix form, it’s not represented elementwise as a derivative with respect to w_j and \frac{\lambda}{m}w_j That was what I was trying to check. My first attempt to prove it to myself had given an inconsistent answer, and we have now established the mistake I made.

I know embarrassing me is a fun game, but I do understand at this point, thank you.

@s-dorsher,
It was never my intention to embarrass you and I’m glad everything is clear now. Actually, I am happy to meet someone familiar with Einstein notation - it’s a great opportunity for me to learn something new. It came handy once, when I manually implemented forward and backward pass for 2d convolution. Very short and clean method (with its limitations though), which allows to avoid cryptic shape massaging. In the vector form we also could simply select an element of the resulting vector: {\partial J_{\rm reg} \over \partial w_j} = \left({\partial J_{\rm reg} \over \partial {\bf w}} \right)_j.

@conscell I overthought that one (not really a great excuse) having spent too much time on tensors, and being eight years rusty. I looked at that and thought, isn’t that a dual vector? Should I really just be able to take the index j and set that equal to the index j of the vector on the other side? I have to admit, my sloppy derivation with delta functions and no metric did not actually clarify, but it did at least prove to me that there was nothing all that fishy about the vector calculus, once I did it right.

Thanks so much for clarifying the obvious ways, though, rotfl! I definitely tripped myself up on this one!

Edit:

I knew the vector calculus answer and couldn’t figure out why I wasn’t getting it, but I legitimately did not think of your summation approach. That was really helpful. Thank you.

1 Like