MLS C1 W2 - Gradient computation: implementation vs equation

Andromeda18 · July 26, 2022, 5:37pm

Hi,

I’m probably missing something here but, as can be seen below, the Python function to compute the gradient has two loops, whereas the equation for the gradient with respect to wj only has one summation symbol.

def compute_gradient(X, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

The way I’m thinking about this, the equation should have a second summation symbol, like this:

Am I wrong about this? Can someone please explain this to me. Thanks!

TMosh · July 26, 2022, 7:21pm

The second loop iterates though the features of each example.

This is necessary to compute f_wb if there is more than one feature, and to compute the gradients for each feature.

Andromeda18 · July 26, 2022, 7:26pm

I understand that. What I don’t understand is why that second loop isn’t represented in the equation.

shanup · July 26, 2022, 8:33pm

@Andromeda18

The second loop is sitting inside f_{w,b}(X^{(i)})

f_{w,b}(X^{(i)}) = w_{1}.x_1^{(i)} +w_2x_2^{(i)}+...+w_nx_n^{(i)} + b

= \sum_{j=1}^{n} w_j.x_j ^{(i)}+b

Now, going back to the original equation:

\frac {\partial J} {\partial w_j} = \frac {1} {m} \sum_{i=1}^{m} (\sum_{j=1}^{n} (w_j.x_j^{(i)} +b) - y^{(i)}).x_j^{(i)}

Note:- When we implement the above equation in python, the summation starts from 0 instead of 1 becuase the python array indices start from 0

Andromeda18 · July 27, 2022, 11:26am

Hi,

Thank you for your answer. I see what you mean and I had completely forgotten about the summation inside f_{\mathbf{w},b}(\mathbf{x}^{(i)}).

However, that second summation only applies to w_{j}.x_j^{(i)}, so, at least when it comes to programming, if we didn’t use the dot product we would actually need a third loop to implement that summation because I don’t see how we could multiply by x_j^{(i)} inside that loop since x_j^{(i)} needs to be multiplied by values that are calculated in the outside loop (\sum_{i=1}^m). To make things clearer, the implementation of the gradient equation, without using the dot product, would be something like this:

def compute_gradient(X, y, w, b): 
    
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.
    pred = 0.

    for i in range(m):         
        for j in range(n):
            pred = pred + (w[j] * X[i,j])
        pred = pred + b
        err = pred - y[i]
   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

This implementation brings me back to my original question: shouldn’t each of these loops be represented in the equation? Shouldn’t it be something like this?

\frac{\partial J}{\partial w_{j}} =\frac{1}{m} \sum_{i=1}^m \sum_{j=1}^n \lgroup\lgroup\lgroup\sum_{j=1}^n w_{j}.x_j^{(i)}\rgroup+b\rgroup-y^{(i)}\rgroup.x_j^{(i)}

I apologize for being so insistent, but I’m trying to understand these concepts as well as possible.

Edit:
I can see how this equation might not be an accurate representation of the implementation above either, because according to the equation the third loop should be nested inside the second loop and it’s not, but I don’t know how else to represent that implementation in equation form. Basically, my problem is not with the implementation but with the mathematical representation.

shanup · July 27, 2022, 12:44pm

@Andromeda18

Firstly let me ease your concern - Mathematically, there isn’t a need for the 3rd loop. It will become clearer when we use number subscripts instead of talking in terms of j

\frac{\partial J}{\partial w_{1}} =\frac{1}{m} \sum_{i=1}^m \lgroup w_{1}.x_1^{(i)} + w_{2}.x_2^{(i)}+...+ w_{n}.x_n^{(i)} +b-y^{(i)}\rgroup.x_1^{(i)}

\frac{\partial J}{\partial w_{1}} =\frac{1}{m} \sum_{i=1}^m \lgroup\sum_{j=1}^n \lgroup w_{j}.x_j^{(i)}+b\rgroup-y^{(i)}\rgroup.x_1^{(i)}

As you can see here, when we are evaluating \frac{\partial J}{\partial w_{1}} we only multiply it with x_1^{(i)} and not (x_1^{(i)}, x_2^{(i)},..,x_n^{(i)})

Likewise, when we are evaluating \frac{\partial J}{\partial w_{2}}

\frac{\partial J}{\partial w_{2}} =\frac{1}{m} \sum_{i=1}^m \lgroup\sum_{j=1}^n \lgroup w_{j}.x_j^{(i)}+b\rgroup-y^{(i)}\rgroup.x_2^{(i)}

Now, coming to the python code, the 3rd For loop is used just for convenience and not because of a mathematical need. The 3rd For loop will help to find \frac{\partial J}{\partial w_{1}}, \frac{\partial J}{\partial w_{2}},...\frac{\partial J}{\partial w_{n}} with a single line of code inside the For loop, as opposed to manually writing “n” lines of code without the For loop. Since this is only a programming convenience, we don’t need to represent it as a 3rd summation in the mathematical representation.

Hope its clear now.

Andromeda18 · July 27, 2022, 2:27pm

I definitely understood your explanation and that is how I was understanding the equation.

However, your explanation allowed me to understand what was confusing me. I was under the impression that the loop in the gradient_descent(…) function, which calls the compute_gradient(…) function, was the loop that iterated over the w parameters, but after your explanation I checked again and that loop actually iterates over the number of iterations defined for gradient descent. Add to that the fact that I was so focused on the multiplication by x_j^{(i)} that I didn’t notice each w parameter was being updated in the same line of code.

So, essentially, I paid too much attention to some details and not enough attention to others and ended up wasting other people’s time. So sorry about that and thanking you for your help and patience.

shanup · July 27, 2022, 2:29pm

You are most welcome @Andromeda18 Glad to be of help, and don’t sweat it - We all go through these confusing thoughts.

Topic		Replies	Views
C1_W2_Lab02_Multiple_Variable_Soln, problem with understanding compute_gradient(X, y, w, b) Supervised ML: Regression and Classification week-2	5	552	September 2, 2022
Week 2:Optional Lab: C1_W2_Lab02_Multiple_Variable_Soln - Understanding the code Supervised ML: Regression and Classification week-2	5	218	April 27, 2024
C1_W3_Logistic_Regression-Gradient Computation Supervised ML: Regression and Classification week-3	2	535	July 31, 2022
Need help understanding function in computation of gradient Supervised ML: Regression and Classification week-3	4	510	February 18, 2023
C1_WK3 Programming assignment - logistic regression - UNQ_C3 Supervised ML: Regression and Classification week-3	3	442	December 28, 2023

MLS C1 W2 - Gradient computation: implementation vs equation

Related topics