Question 1.
If Gradient checking with a two-sided limit is so accurate(and we know it can’t mess up like our backprop equations), why isn’t that what we do always to calculate the derivative? I ask this because for some reason visually the formula looks to me to be faster than these backdrop formulas below?
dW[l]=np.dot(dZ[l], A[l-1].T)
db[l]=(1/m)*np.sum(dZ[l], axis=1, keepdims=True)
Visually I can’t seem to figure out how backpropagation equations are faster than using the limit formula. And the difference in accuracy is so small(order of 10^-7), it feels like it should be insignificant?
I would appreciate it if someone could show me how the time complexity of the gradient checking is worse than that of normal backpropagation.
Question 2.
I don’t seem to get how Weights and biases will be reshaped to theta. And if J(theta) is indeed calculated how does that relate to db,dW which we get via backpropagation. I understood that db and dW would be reshaped into dtheta too, but what exactly does it mean. Is theta[l]=[W[l],b[l]]