
The code used for backward_propagation is as above but the formula says different
shouldn’t it be
dz3 = (a3 - Y)
dW3 = 1/m * np.dot(dz3, a2.T)
db3 = 1/m* np.sum(dz3, axis=1, keepdims = True)
da2 = np.dot(W3.T, dz3)
dz2 = np.multiply(da2, np.int64(a2 > 0))
dW2 = 1/m * np.dot(dz2, a1.T)
db2 = 1/m*np.sum(dz2, axis=1, keepdims = True)
da1 = np.dot(W2.T, dz2)
dz1 = np.multiply(da1, np.int64(a1 > 0))
dW1 = 1/m*np.dot(dz1, X.T)
db1 = 1/m*np.sum(dz1, axis=1, keepdims = True)
Maybe I’m just missing your point, but it’s a question of how you manage the factor of \frac {1}{m} that comes in when you compute \displaystyle \frac {\partial J}{\partial L} as the last step of computing dW^{[l]} and db^{[l]}. There are several correct ways to formulate that. The given code is arguably more efficient in that they include that just once as a factor in dz3 and then it percolates through all the other computations automatically, rather than the way we did it in Course 1. But you could argue the way they did it here in the “utility functions” file is a bit odd and not quite as “pure” as the way we did it in Course 1. But I think they were just going for efficiency (less code) in the utility functions and assumed not many people would actually look at the code and analyze it. Of course notice that they also “hard-code” the number of layers in both forward and back propagation, just to keep the code simple. This is not the real code you’d write for the general case, of course, just support routines tailored to this one exercise.