I understand the intuition behind back propagation and how the six variables were derived. However, I don’t understand why for the matrix implementation we set the bias vectors to be the sum of the rows of dz.
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
From my understanding, dZ is matrix whose rows represent the m samples of data. So wouldn’t it make more sense to set db = dZ
since db
would be a matrix where each bias vector corresponds to a sample datapoint?
Here are the dimensions of db^{[1]} and dZ^{[1]}:
db^{[1]} and b^{[1]} have dimensions n^{[1]} x 1, where n^{[1]} is the number of output neurons in layer 1.
dZ^{[1]} and Z^{[1]} are n^{[1]} x m, where m is the number of samples.
Note that all the “parameters”, meaning the W^{[l]} and b^{[l]} values, are independent of the number of samples. The forward propagation formulas work for any number of samples without changing the shapes of W and b.
Note that the derivation of all the back propagation formulas is beyond the scope of this course. Here’s a thread with links to other materials that cover the topic.