Derivative of regularization term

When we took the derivative of the loss function for linear regression, the summation stayed, whereas the derivative of the regularization component results in the summation disappearing. Is this because we are subtracting from a particular w_j, therefore the other terms don’t matter? If we were to think of this in vectorized terms, and deal with w instead of w_j, would the formula for J(w,b), and therefore for the weight updates in gradient descent, be simpler? In other words, does that second summation not exist in the vectorized version, and that term is (lambda/2m) * (w^2) resulting in (lambda * w)/m?

Hello @pritamdodeja

Lets look at what happens mathematically to the regularization term \frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2 when we are finding the derivative of cost function w.rt each weight variable, w_j → Lets take the case of w_1.

\frac {\partial {J(\vec{w},b)}}{\partial w_1} = \frac {\partial}{\partial w_1}(.....+ \frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2)

= \frac {\partial}{\partial w_1}(.....+ \frac{\lambda}{2m}(w_1^2 + w_2^2 + ... + w_n^2)) → Expanding the summation

= .....+ \frac{\lambda}{2m}(2* w_1) → because \frac {\partial}{\partial w_1}(w_2^2 + ... + w_n^2) = 0 since they are all constants w.rt w_1

Thus, \frac {\partial {J(\vec{w},b)}}{\partial w_1}= .....+ \frac{\lambda}{m}w_1 → And this is how the summation for the regularization component vanished

And in the general case, \frac {\partial {J(\vec{w},b)}}{\partial w_j} = .....+ \frac{\lambda}{m}w_j

Hope its clear now.


I watched Eddy’s video and was able to derive the following for the partial derivative of J with respect to w with regularization

I will have to work through this in implementation in terms of the vectorized version to better conceptually understand what is going on. The derivative with respect to w instead of w_j I’m trying to come to terms with. Thank you for your help!

1 Like

Good going @pritamdodeja. See how it goes with the vectorization case as well.

1 Like

Thanks @shanup and @rmwkwok! I just finished course 1, but am thinking of spending some more time on these concepts as I think there is a lot of depth here that needs to be explored/understood. Thank you for your mentorship!

Good going @pritamdodeja!! At face value it might seem that Course 1 comprises of simple concepts and can be quickly glossed over, but you are very right - there is a wealth of information to be derived from it and the ungraded labs, if looked at meticulously, will help us gain a lot understanding. This will help us even when we move on to more advanced stuff.

1 Like

I implemented the vectorized version and it is super short, maybe it can fit in a tweet :). I am going to study some linear algebra for a little bit to try and get a better intuition for the shapes involved and the meaning of these dot products. I’m going to see if I can simplify this further, not a fan of the reshaping I am doing here.

1 Like

Hello @pritamdodeja, nice try. Sometimes I like to put down the shape of each variable at the end of each line as comments to remind myself what operations are needed. I can tell you that reshape isn’t necessary :), remember for 2 matrices A_{m\times n} and B_{h \times k}, a valid multiplication requires n = h, so besides the shapes, the ordering of A and B also matters.


Hi @rmwkwok, I am doing this with (m, n, 1) hence am having to reshape. Am going to do a deep-dive on dot product and other linear algebra concepts to see if I can simplify this. Another thing I’m thinking to do is to create a toy problem that predicts points in 2d space, motivated by trying to understand the generalization of the formulas etc. I’m trying to see if there is some advantage to think of (m, n, 1) as we get into higher dimensions. I will put up reproducible code here shortly so you can recreate my setup if you like. Thank you!

Ah I see! Pritam, would you mind sharing the shapes of your w_initial, w and your y_train?


Here is a (hopefully) reproducible example:

import numpy as np                                                                                                                                                      
# {{{ m, n, 1 case                                                                                                                                                      
X_shape = (1000, 3, 1)                                                                                                                                                  
m = X_shape[0]                                                                                                                                                          
w_shape = X_shape[1:]                                                                                                                                                   
X = np.random.random(size=X_shape)                                                                                                                                      
w_generative = np.random.random(size=w_shape)                                                                                                                           
y_shape =, X).shape[-1]                                                                                                                           
b_generative = np.random.random(size=())                                                                                                                                
y_train =, X) + b_generative #w_generative is being broadcast here - w.T has shape 1, 3, it is a row vector.  w is a column vector, and so is X_i.  To determine if it's a column or row vector, look at the axis that has 1                                                                                              
b_initial = 0                                                                                                                                                           
w_initial = np.zeros(shape=w_shape)                                                                                                                                     
w = w_initial                                                                                                                                                           
b = b_initial                                                                                                                                                           
alpha = 0.01                                                                                                                                                            
number_of_iterations = 50000                                                                                                                                            
for iteration_number in range(number_of_iterations):                                                                                                                    
    predicted =, X) + b                                                                                                                                      
    error = predicted - y_train                                                                                                                                         
    error = error.reshape(-1, y_train.shape[-1])                                                                                                                        
    cost =, error)/(2*m)                                                                                                                                 
    dj_dw =, error).reshape(w_initial.shape)/m                                                                                                               
    dj_db =, np.ones(y_train[0].shape))/m                                                                                                                
    w, b = w - alpha*dj_dw, b - alpha*dj_db                                                                                                                             
print(f"Cost term at the end is {cost}")  

Your shape for X is (sample size, feature size, depth size).

Given that the first index refers to the size of samples, shouldn’t your y takes the shape of (1000, 1)when y has depth or (1000, ) when y doesn’t have depth, or simply (1000, 1, 1)?

Currently, your y has the shape of (1, 1000, 1).

I agree that (1000, 1) seems to be the logical shape for y_train, but I didn’t set the shape of y_train, it is a result of, X) + b_generative. It is the same case with predicted, also a result of the dot product plus the bias/intercept term. I have to conceptually understand the transformation that the dot product represents to make some progress here.

Of course! Please take you time. I think numpy can do it – no matter which shape of y you want, but we only need to apply the right dot transformation as you said, so I agree, please take your time and let us know if you have a question. Good luck with your exploration!! :wink:

I got the case where y is two dimensional to work :slight_smile: :). In this case, X is a set of vectors in the input 3d space, which get mapped to an output 2d space by matrix w, which is really two column vectors u and v stacked on top of each other. The three components of u and v are the linear transformations that map the basis vectors in the input space to the output space. So in a sense, the problem has two parts: how to map the input space to the output space, which is the part that Linear Algebra solves, and where exactly do the basis vectors map, which is the part that Calculus solves. This is really blowing my mind right now! I have been watching Essence of linear algebra preview - YouTube to understand this better. Now that it’s working, I’m going to try and understand why it’s working and what’s in common with the lower dimensional situation :slight_smile:

1 Like

Hi Pritam, sounds like you are enjoying. That’s great.
P.S. I like videos of 3Blue1Brown too.

I spent some time investigating the (m, n, 1) vs (m, n, ) situation, find screenshot below. The left is the (m, n, 1) case and the right is the (m, n, ) case. The tradeoff appears to be you get cleaner formulas for the transformations with (m, n, 1) but the cost is having to squeeze all the points. With (m, n, ) you don’t need to squeeze any of the points, but the transformations look unnatural. I am now going to investigate bringing this down to lower dimensions.

Another thought that went through my head was how does the two-dimensional case map to the logistic regression scenario. For example, we could map the output space to the unit square and make predictions from four choices, or a cube, or any arbitrary shape. The span of the basis vectors in the output space seems to be a very interesting thing apparently.

I’m pasting the code for the (m, n, 1) scenario below in case it is useful.

# {{{ m, n, 1 case and y is two dimensional cleaned up                                                                                                                      
import numpy as np                                                                                                                                                          
number_of_samples = 1000                                                                                                                                                    
input_dimensions = 3                                                                                                                                                        
output_dimensions = 2                                                                                                                                                       
X_shape = (number_of_samples, input_dimensions, 1)                                                                                                                          
m = number_of_samples                                                                                                                                                       
X = np.random.random(size=X_shape)                                                                                                                                          
generative_matrix = np.random.random(size=(output_dimensions, input_dimensions, 1))                                                                                         
b_generative = np.random.random(size=(output_dimensions,1))                                                                                                                 
y_train = np.squeeze(generative_matrix)@X + b_generative                                                                                                                    
y_shape = (np.squeeze(generative_matrix)@X + b_generative).shape[1:]                                                                                                        
b = np.zeros(shape=b_generative.shape)                                                                                                                                      
u = np.zeros(shape=(input_dimensions, 1))                                                                                                                                   
v = np.zeros(shape=(input_dimensions, 1))                                                                                                                                   
w = np.squeeze(np.stack((u,v)))                                                                                                                                             
alpha = 0.01                                                                                                                                                                
number_of_iterations = 5000                                                                                                                                                 
for i in range(number_of_iterations):                                                                                                                                       
    predicted = w@X + b                                                                                                                                                     
    error = np.squeeze(predicted - y_train)                                                                                                                                 
    cost = np.trace(, error))/(2*m)                                                                                                                           
    dj_dw =, np.squeeze(X))/m                                                                                                                                
    dj_db =, np.ones((m,1)))/m                                                                                                                               
    w, b = w - alpha*dj_dw, b - alpha*dj_db                                                                                                                                 
    print(f"Cost is now {cost}")                                                                                                                                            
print(f"Cost term at the end is {cost}")                                                                                                                                    
    # }}}  

Hey Pritam,

My version. The Einstein Summation np.einsum is particularly helpful in high dimensions. :wink:

import numpy as np
number_of_samples = 1000
input_dimensions = 3
output_dimensions = 2
depth_dimensions = 1

def dot_along_first_axis(A, B):
    '''A.shape = (i,j,k) B.shape =(p,j,k)'''
    return np.einsum('ijk, pjk->ipk', A, B)

# Generate X and y
X = np.random.random(size=(number_of_samples, input_dimensions, depth_dimensions)) #shape (1000, 3, 1)
w_generative = np.random.random(size=(output_dimensions, input_dimensions, depth_dimensions)) #shape (2, 3, 1)
b_generative = np.random.random(size=(output_dimensions, depth_dimensions)) #shape (2, 1)
y_true = dot_along_first_axis(X, w_generative) + b_generative #shape (1000, 2, 1)

# initialize parameters
b = np.zeros(shape=b_generative.shape) #shape (2, 1) 
w = np.zeros(shape=w_generative.shape) #shape (2, 3, 1)

# sample size and hyper parameters
m = X.shape[0]
alpha = 0.01
number_of_iterations = 5000

# gradient descent
for i in range(number_of_iterations):                                                                                                                                       
    y_pred = dot_along_first_axis(X, w) + b #shape (1000, 2, 1)
    error = y_pred - y_true #shape (1000, 2, 1)
    cost = np.sum(error**2)/(2*m) # scalar, same as np.einsum('ijk, ijk', error, error)/(2*m)
    dj_dw = np.einsum('ijk, iqk->jqk', error, X)/m #shape (2, 3, 1)
    dj_db = np.sum(error, axis=0)/m #shape (2, 1), same as np.einsum('ijk->jk', error)/m
    w = w - alpha*dj_dw #shape (2, 3, 1)
    b = b - alpha*dj_db #shape (2, 1)
    # print(f"Cost is now {cost}")

print(f"Cost term at the end is {cost}")

Wow, this looks beautiful. Am going to have to study it to understand the magic that is going on here. I studied mostly linear algebra and linear regression today, but then went back to logistic regression. I am just loving getting to study such fundamental concepts!

If I may be so bold as to offer a slight addition to your explanation and the calculation of the partial derivative - I think the confusion some people experience with the prof’s slide is due to an unfortunate choice of notation - the j in w_j under the partial derivative operator and the j in w_j under the summation sign in the regularisation term conceptually mean two different things.

The j in the partial derivative operator is there to point to a specific element of the parameter vector w, while the j in the regularisation term is used as a general index variable in a summation. This entire sum in the regularisation term is computed for every value of j in the partial derivative operator.

In this specific example, a more poignant choice of notation, in my opinion, would have been to use a different index variable to sum over in the regularisation term, such as k. This would clearly highlight that the regularisation term is made of a sum of terms (w_1)^2, (w_2)^2, etc.

Then given some value of j, such as j = 2 we would seek a partial derivative with respect to w_2 of the sum in the regularisation term. Applying the standard rules for differentiation of sums, constants, and powers of a variable, we would find that all but one terms of the sum are constants with respect to the variable w_2 and so they vanish under the differentiation, leaving only the derivative of (w_2)^2 term in place, hence removing the summation sign from the resulting equation.

I thought this was worth mentioning, for the sake of completeness :slightly_smiling_face: