MLS C1 W2 - Gradient computation: implementation vs equation

Hi,

I’m probably missing something here but, as can be seen below, the Python function to compute the gradient has two loops, whereas the equation for the gradient with respect to wj only has one summation symbol.

image

def compute_gradient(X, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

The way I’m thinking about this, the equation should have a second summation symbol, like this:
image

Am I wrong about this? Can someone please explain this to me. Thanks!

The second loop iterates though the features of each example.

This is necessary to compute f_wb if there is more than one feature, and to compute the gradients for each feature.

I understand that. What I don’t understand is why that second loop isn’t represented in the equation.

@Andromeda18

The second loop is sitting inside f_{w,b}(X^{(i)})

f_{w,b}(X^{(i)}) = w_{1}.x_1^{(i)} +w_2x_2^{(i)}+...+w_nx_n^{(i)} + b

= \sum_{j=1}^{n} w_j.x_j ^{(i)}+b

Now, going back to the original equation:

\frac {\partial J} {\partial w_j} = \frac {1} {m} \sum_{i=1}^{m} (\sum_{j=1}^{n} (w_j.x_j^{(i)} +b) - y^{(i)}).x_j^{(i)}

Note:- When we implement the above equation in python, the summation starts from 0 instead of 1 becuase the python array indices start from 0

Hi,

Thank you for your answer. I see what you mean and I had completely forgotten about the summation inside f_{\mathbf{w},b}(\mathbf{x}^{(i)}).

However, that second summation only applies to w_{j}.x_j^{(i)}, so, at least when it comes to programming, if we didn’t use the dot product we would actually need a third loop to implement that summation because I don’t see how we could multiply by x_j^{(i)} inside that loop since x_j^{(i)} needs to be multiplied by values that are calculated in the outside loop (\sum_{i=1}^m). To make things clearer, the implementation of the gradient equation, without using the dot product, would be something like this:

def compute_gradient(X, y, w, b): 
    
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.
    pred = 0.

    for i in range(m):         
        for j in range(n):
            pred = pred + (w[j] * X[i,j])
        pred = pred + b
        err = pred - y[i]
   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

This implementation brings me back to my original question: shouldn’t each of these loops be represented in the equation? Shouldn’t it be something like this?

\frac{\partial J}{\partial w_{j}} =\frac{1}{m} \sum_{i=1}^m \sum_{j=1}^n \lgroup\lgroup\lgroup\sum_{j=1}^n w_{j}.x_j^{(i)}\rgroup+b\rgroup-y^{(i)}\rgroup.x_j^{(i)}

I apologize for being so insistent, but I’m trying to understand these concepts as well as possible.

Edit:
I can see how this equation might not be an accurate representation of the implementation above either, because according to the equation the third loop should be nested inside the second loop and it’s not, but I don’t know how else to represent that implementation in equation form. Basically, my problem is not with the implementation but with the mathematical representation.

@Andromeda18

Firstly let me ease your concern - Mathematically, there isn’t a need for the 3rd loop. It will become clearer when we use number subscripts instead of talking in terms of j

\frac{\partial J}{\partial w_{1}} =\frac{1}{m} \sum_{i=1}^m \lgroup w_{1}.x_1^{(i)} + w_{2}.x_2^{(i)}+...+ w_{n}.x_n^{(i)} +b-y^{(i)}\rgroup.x_1^{(i)}

\frac{\partial J}{\partial w_{1}} =\frac{1}{m} \sum_{i=1}^m \lgroup\sum_{j=1}^n \lgroup w_{j}.x_j^{(i)}+b\rgroup-y^{(i)}\rgroup.x_1^{(i)}

As you can see here, when we are evaluating \frac{\partial J}{\partial w_{1}} we only multiply it with x_1^{(i)} and not (x_1^{(i)}, x_2^{(i)},..,x_n^{(i)})

Likewise, when we are evaluating \frac{\partial J}{\partial w_{2}}

\frac{\partial J}{\partial w_{2}} =\frac{1}{m} \sum_{i=1}^m \lgroup\sum_{j=1}^n \lgroup w_{j}.x_j^{(i)}+b\rgroup-y^{(i)}\rgroup.x_2^{(i)}

Now, coming to the python code, the 3rd For loop is used just for convenience and not because of a mathematical need. The 3rd For loop will help to find \frac{\partial J}{\partial w_{1}}, \frac{\partial J}{\partial w_{2}},...\frac{\partial J}{\partial w_{n}} with a single line of code inside the For loop, as opposed to manually writing “n” lines of code without the For loop. Since this is only a programming convenience, we don’t need to represent it as a 3rd summation in the mathematical representation.

Hope its clear now.

I definitely understood your explanation and that is how I was understanding the equation.

However, your explanation allowed me to understand what was confusing me. I was under the impression that the loop in the gradient_descent(…) function, which calls the compute_gradient(…) function, was the loop that iterated over the w parameters, but after your explanation I checked again and that loop actually iterates over the number of iterations defined for gradient descent. Add to that the fact that I was so focused on the multiplication by x_j^{(i)} that I didn’t notice each w parameter was being updated in the same line of code.

So, essentially, I paid too much attention to some details and not enough attention to others and ended up wasting other people’s time. So sorry about that and thanking you for your help and patience.

You are most welcome @Andromeda18 :blush: Glad to be of help, and don’t sweat it - We all go through these confusing thoughts.