Stochastic Gradient Descent (why do we update all weights)

From the Week #2 assignment, we learned about step 3 for SGD

  1. Over the layers (to update all parameters, from (W^{[1]},b^{[1]}) to (W^{[L]},b^{[L]}))

I’m trying to understand why exactly are we updating all the weight, biases for each layer. It doesn’t seem intuitive to me because we would normally just care about W1, b1 (ie. just like batch GD)? My only guess is so that we can introduce some kind of optimization (weighted avg, “memory”) at each unit/layer in order to introduce stability across training.

What am i missing?

Updating weights & biases is required for the NN to predict better.

When the weights get updated gives rise to 3 flavors:

  1. Batch gradient descent updates weighs after processing all training examples.
  2. Mini batch gradient descent updates weights after processing mini_batch_size number of examples of training data.
  3. Stochastic gradient descent updates weights after each training example.