Gradient Descent for Logistic Regression

Hello and thank you for an excellent course!
My question is regarding the Gradient Descent Implementation function in the Optional lab: Gradient descent for logistic regression.
this is the algorithm:

in the function “def compute_gradient_logistic(X, y, w, b):”
dj_dw = np.zeros((n,)) #(n,)
is zeroed only before the outer loop.
question: shouldn’t it be zeroed before each inner loop, that is zeroed for each vector X in the training data?
I think that the correct implementation should include the remarked line:

{code removed by mentor}

Please explain why I am wrong.

With dj_dw = dj_dw/m, you mean to accumulate dj_dw over m samples, how would you then reset it to zero before all samples are iterated through?


PS: I removed your code because we can’t share assignment code here.

Hi @Aviv_Simionovici ,

dj_dw = np.zeros((n,))
is to create the variable dj_dw as a vector of size n ( the number of features).
This variable is updated for each sample and each feature of the sample in the code, so there is no need to create the same variable again and again.

thanks @rmwkwok and @Kic - I realized now that the code needs both loops - the outer loop to sum / sigma over i - the training samples and the inner loop to address all j - features. At each training sample i run an inner loop over all features and accumulate the dj_dw[j]. Nested loops mechanism. It doesn’t make sense to zero before the inner loop (or to recreate the variable) - my mistake - I am rusty at coding.

You are very welcome, @Aviv_Simionovici! :wink:


I have a related question
We have:

In Lab 6 of Week 3 (C1_W3_Lab06_Gradient_Descent_Soln) the compute_gradient_logistic function that calculate this partial derivative has the outer loop of i (looping through each training example) with the inner loop of j (the sum of each j-th feature from 1 to n).

I don’t understand - in the formula the j is fixed and the sum is looping by i, but in the lab the i is fixed and the sum is looping by j?
Thank you!

Hello @Svetlana_Verthein,

Let’s think about this:

  1. If we follow exactly the formula you shared with us, what’s going to happen is that the compute_gradient_logistic will accept a parameter called j, and then instead of having an inner loop to iterate through different values of j, we should remove it and just fix the j value as provided, right?

  2. In that case, the compute_gradient_logistic function will only be useful to calculate one j at a time.

  3. But we don’t want to call compute_gradient_logistic as many times as the number of features.

  4. So, instead of passing the value of j in, we use a loop to go over all possible j, so that we only need to call compute_gradient_logistic once, and all the weights’ gradient are found.

Is this clear to you?


Hello, Raymond, and thank you so much for your answer.
But that’s not what I had in mind:

  1. compute_gradient_logistic doesn’t need to accept j as a parameter
  2. everything in this function stays the same, only the outer loop becomes a j-loop (finding w_j for each one of the feature columns, i.e. w_1 for column of all features x_1, etc for all n features) and the inner loop becomes the i-loop (than we’d exactly follow the formula which specifies sum over all training examples 1 to m for each j).

In other words, in my mind I’d just reverse the j-loop (make it the outer loop) and i-loop (make it the inner loop calculating the sum over all the training examples for that feature) - and it would conform to the formula.
Otherwise it seems we are finding a unique w for each training example, rather than for each feature.
Does it make sense?
Thank you!

Hello @Svetlana_Verthein,

Switching the order of looping samples and loop features is fine, and I think your approach also makes sense. :slight_smile:


Hi, I have a related question to this thread:

in the def compute_gradient_logistic(X, y, w, b) function, in the For loop over ‘i’,

f_wb_i is equal to the sigmoid of a dot product of X[i],w + b.

Does the dot product not involve multiplication of all x (x0, x1…xn) and w (w0, w1…wn) values for each [i] (example), i.e. all n w’s? If so, why do we need an inner j loop to go through all the n features if the features have all been included with each run of the first (‘i’) loop?


Hi Jem,

Your understanding that it involves is correct.

There are two ways to do it - (i) via an inner loop, or (ii) do another vector operation. The lab chose the first way, but we can actually implement it without the inner loop. As an exercise, can you figure it out how to do it without the inner loop? :slight_smile:


Hi Raymond,

thanks for your reply. Sorry, I was asking why given the dot product step involves multiplication of all x and w values for each [i] example, we need another step to go through all the n features. I’m obviously missing something, but it appears we are multiplying by each w feature twice with these two steps.