Video with derivative of logistic regression cost function
I was taking a look at the derivative for the regularization term in the cost function in this video,
J_{reg}=\frac{\lambda}{2 m}\sum_{j=1}^n w_j^2
To find the gradient, we need to take the derivative, and I was taking a look at that myself. When I take the derivative, I find it to be
\frac{n\lambda}{m}\omega_j
The important difference is the n
Let’s ignore the \frac{\lambda}{2m} in the cost function and multiply by that at the end.
Here is how I arrived at that answer. Repeated indices use the Einstein summation convention Einstein notation and the delta_{ij} symbol is the Kroneker delta function Kronecker delta function which is one when i and j are equal and zero otherwise.
\frac{\partial}{\partial \omega_j}\sum_{k=1}^n \omega_k^2
=\frac{\partial}{\partial \omega_j}\left[\delta_{ik}(\omega_i \omega_k)\right]
This has to be a \delta_{ik} not a \delta_{ij} because it makes the \omega^2 term due to a sum over the i and k indicies. For example, \sum_i \sum_{j, where j=i} x_i y_j =\delta_{ij}x_iy_j This is the same as \sum_j \omega_j^2=\sum_k\sum_{i,i=k}\omega_i\omega_k =\delta_{ik}\omega_i\omega_k.
=\delta_{ik}\frac{\partial}{\partial\omega_j}(\omega_i\omega_k)
=\delta_{ik}\left[\frac{\partial\omega_i}{\partial\omega_j}\omega_k+\omega_i\frac{\partial\omega_k}{\partial\omega_j}\right]
Here it’s worth noting that
\frac{\partial \omega_\alpha}{\partial\omega_\beta}=\delta_{\alpha\beta}
for any \alpha and \beta in i,j,k
So returning to the gradient of the cost regularization term:
=\delta_{ik}\left[\delta_{ij}\omega_k+\omega_i\delta_{kj}\right]
=\delta_{ik}\delta_{ij}\omega_k+\delta_{ik}\delta_{kj}\omega_i
For any pair of kronecker deltas:
\delta_{\alpha\gamma}\delta_{\gamma\beta}=N\delta_{\alpha\beta} where N is the number of dimensions that the indices can have. Here, N = n.
For any kronecker delta,
\delta_{\alpha\beta}=\delta_{\beta\alpha}
Kronecker delta functions are symmetric
So I rewrote the last step as
=\delta_{ki}\delta_{ij}\omega_k+\delta_{ik}\delta_{kj}\omega_i
=n\delta_{kj}\omega_k+n\delta_{ij}\omega_i
But recall that a kronecker delta is one if the indices are equal and zero otherwise. So
\delta_{\alpha\beta}\omega_\beta=\omega_\alpha
There’s no factor of n here because there’s only one index \beta that matches \alpha, rather than needing to sum over an entire set of repeated indices \gamma.
In the final two steps:
=n\omega_j+n\omega_j
=2n\omega_j
Okay now let’s combine with the $\frac{\lambda}{2m} from the cost function.
=\frac{n\lambda}{m}\omega_j
I’m wondering why the choice to regularize by m, the number of dimensions in the training set, rather than by n, the number of parameters in the model?
Thanks,
Steven