I have a question regarding the lab of week 2, multiple linear regression. In the line: dj_dw[j] = dj_dw[j] + err * X[i, j]. They use the previous value of the gradient of w with respect to j and add to that value upon the existing one. Why?

I can understand it when you update the value when you calculate the final w array but why do they also do it for calculating the gradient? Thanks.

In the compute_gradient function you provided, the line dj_dw[j] = dj_dw[j] + err * X[i, j] is an implementation of the gradient computation for multiple linear regression. Let me explain why this approach is used:

Accumulating Gradient Over All Examples: The goal of the compute_gradient function is to compute the gradient of the cost function with respect to each parameter w[j] and the bias b. In the context of multiple linear regression, the cost function is often a mean squared error (MSE) function. The gradient tells us how much the cost function changes with a small change in the parameters.

Batch Gradient Descent: The code implements a form of batch gradient descent. In this method, the gradient is computed over the entire dataset (m examples) before updating the parameters. This is why for each feature j, the algorithm sums up err * X[i, j] for all examples i.

Understanding the err * X[i, j] Term: The term err * X[i, j] is essentially the partial derivative of the cost function with respect to the parameter w[j]. Here, err = (np.dot(X[i], w) + b) - y[i] is the prediction error for the i-th example. Multiplying this error by X[i, j] gives the contribution of the j-th feature of the i-th example to the gradient.

Why Summation is Necessary: By accumulating dj_dw[j] across all examples, we effectively compute the sum of these gradients, which is needed to calculate the mean gradient (since the MSE involves an average over all examples). After completing the loop over all examples, the code then averages the accumulated gradients by dividing dj_dw and dj_db by m. This averaging step is crucial because it ensures that the gradient is representative of the entire dataset, not just a single example.

The Final Update Step: After computing the gradient, the weights w and bias b are typically updated outside this function, in the direction opposite to the gradient (gradient descent step). This is where the actual parameter update happens, using the gradients computed by this function.

So, the reason for the accumulation (the dj_dw[j] = dj_dw[j] + err * X[i, j] step) is to sum up the gradients across all examples before averaging, which aligns with the mathematical formulation of the gradient for the MSE cost function in batch gradient descent.