I am getting an assertion error on the cost. The first value is correct, but the one after many iterations is not. I looked into my values of grad(L) which is converging to 0 for the weights but is not for the biases for some reason. The cost initially decreases due to the weights reaching the absolute minima, but then the erratic behavior of the bias causes it to begin to increase. The fact that dL/db is changing in the way it is would suggest there are multiple critical points which of course is not true. I believe my update rules are fine (e.g.w -= learning_rate*dw). I do not see where the problem could lie.

Yes, this is the one place where -= actually should work, because they did the “deepcopy” of w and b in the template code.

I’m not so sure I see why you think the problem is specific to b. I suggest looking for bugs in your gradient formulas or perhaps you are “hard-coding” the learning rate. You don’t really say clearly which test case this is, but I’m guessing it’s the optimize test cell. The second test case there uses a different learning rate than the default 0.009, so if you aren’t using the value that’s passed it will cause problems.

My prior gradient formulas were correct by the grader, but I will look into it again. I believe the problem is with dL/db as the absolute optimum is where the gradient vector is equal to 0. Based on the fact that the cost function is convex, gradient descent should be finding the location of this point after infinite many iterations. I don’t believe that the problem is with the learning rate as even if it is hardcoded, the value of dL/db would do one of the following behaviors. 1. converge straight to 0, 2. alternate from positive to negative values and then eventually converge to 0, 3. it would alternate from negative to positive and diverge from 0. (these assumptions are on the basis of convexity). I may be making wrong assumptions. Please correct me if I am wrong.

I’m glad to hear that you found your mistake, but is also worth discussing your ideas about convergence. Even with a convex problem (as we have with Logistic Regression) there is never any guarantee that Gradient Descent will converge to exactly the correct solution with a fixed learning rate, as we are doing here. That is a limitation of using a fixed learning rate. You can “overshoot” and oscillate or even diverge. There are more sophisticated algorithms which adaptively reduce the learning rate as they get closer to the local or global optimum, but that is not what we are doing here.