Hi learners and teachers,

While watching the video about Gradient Descent in Multiple Linear Regression, I got a question that if we are simultaneously updating the values of parameters **w** and **b**, what value of **w** from **w1** to **wn** will we use while updating **b** in the Gradient Descent’s algorithm for **b**?

It was easier to understand it in Univariate Linear Regression where there were only one **w** and **b**. Simultaneous update made sense there. But I can’t get this in Multiple Linear Regression. There are so many **w’s** to use for updating **b** and we would get different **b** every time if we use all of them.

Any help is more than welcome!

Hi @javeriaa

In Multi-variate Linear Regression, the gradient descent algorithm updates all parameters (Ws and b) simultaneously in each iteration using the same set of predictions. The gradients for each w_j and b are computed based on the current parameter values. This simultaneous update ensures that all parameter changes are synchronized and they reach convergence effectively.

Many thanks for replying

My question is about the part of updating **b** only

I understand it should be done simultaneously but as

**b** is not a vector and

**w** is a vector, what I don’t understand is what value of

**w** are we going to put in

**f(x)** i.e

**wx+b** in the Gradient Descent Algorithm for

**b**?

Please tell me if I am missing something here.

Let me put it this way, in many representations, the bias term b is incorporated into the weight vector as w_0, with x_0 set to 1 for all samples. This unifies the parameters into a single vector, W, which is updated simultaneously

If you still have questions or need further clarification, feel free to ask, and I can provide more detailed explanations with formulas!

Why does it matter whether b is a scalar or a vector? The point is that the gradients are computed based on the current values of all the parameters (w and b). Then they are applied to both sets of parameters giving the new values that will be used in the next iteration.

In other words the entire process of the training iterations is:

- Compute forward propagation with the current parameter values.
- Compute gradients with the current parameter values.
- Apply the gradients to update the parameters.
- Go to step 1) for the next iteration using the updated parameters.

@javeriaa in my mind, and Paul can like smack me here if I am mathematically wrong, but what I think at least the way it is taught, or generally explained with regards to neural nets…

Perhaps you don’t get to see the ‘traditional’ matrix form of linear regression, which is basically:

y = X\beta + \epsilon

Does this look familiar ? Because it *really should*.

But in ‘standard regression’ (here in matrix form), you have no layers, no activations, so your optimization function is basically calculating \beta, which is a one step operation as \beta = (X^TX)^{-1}X^Ty. Your optimization occurs right there (as least squares).

In this limited case, without activations or a further loss function, everything magically happens right there.

But when you start to talk about neural networks, every additional layer to the network, and all the additional cross connections (even aside from the fact that the activations make this now ‘non-linear’), we cannot just ‘walk back’ in one step to do our updates.

We kind of have to meander there. Yet, at least in a ‘traditional’ neural net, your \epsilon is really no different than your b bias term.

It is kind of linear regression all the way down, but how you gather terms and optimize is quite different.