From the Week #2 assignment, we learned about step 3 for SGD
- Over the layers (to update all parameters, from (W^{[1]},b^{[1]}) to (W^{[L]},b^{[L]}))
I’m trying to understand why exactly are we updating all the weight, biases for each layer. It doesn’t seem intuitive to me because we would normally just care about W1, b1 (ie. just like batch GD)? My only guess is so that we can introduce some kind of optimization (weighted avg, “memory”) at each unit/layer in order to introduce stability across training.
What am i missing?