Shouldn’t we must store each jth ‘w’ in a temp_wj and b in a temp_b while doing a descent, and then only update the original wj, b ?

Also in real life application how the initial value for so many ‘w’ will be chosen what I mean is what will be the starting set of those values?

It will take a lot of memory and its not really needed, unless performance goes bad and in that case you can set a ‘callback’ to exit the gradient descent. Normally with Keras/Tensorflow they have an option of saving a state of the gradient descent and coming back to it for further training in the future, but really the goal is to find an optimal state not remember every passing state!

Its random and something not very big, I would say something between 0 and 1. In Tensorflow they have quite a few initializers check these out:

In simple linear regression Andrew said that better practice is to store the temp value of w so that it’s updated value isn’t passed while calculating the descent of ‘b’

So I thought that same must hold true for multiple linear regression.