In “Numerical Approximation of Gradients”, Andrew shows that - for the two-sided difference - the approx. error is on the order of epsilon squared, which is much smaller than the approx. error of the one-sided difference (which is on the order of epsilon), given that epsilon < 1.
So, as a natural consequence of this premise, I would have expected that the threshold for considering the grad check as passed would be in the order of epsilon squared (1e-14 in the example) and not in the order of epsilon (1e-7 in the example), because d_theta_approx has been computed with the two-sided formula and its approx. error with respect to d_theta (computed with back prop) should be in the order of epsilon squared.
It’s a good point, but don’t forget that we’re not just dealing with the error generated by using finite differences to approximate the gradients: everything here is in floating point. So we also need to allow a little leeway for the general noise induced by the usual rounding error / error propagation behaviors. In other words, even a completely correct algorithm is going to have some error just from the usual numerical analysis issues. E.g. something as simple as the difference between:
cost = -1./m * (some formula with np.dot or np.sum and np.multiply)
versus
cost = -(some formula with np.dot or np.sum and np.multiply)/m
Will generate noticeably different results in a few thousand iterations.
Interesting observation!
I think they point of this exercise is to see the difference of using the one-sided vs the two-sided method to check your back_propagation implementation is good enough and the degree of certainty you can get of out them.
Following the above, in the 2-sided case, we are not really interested in checking that d_theta_approx is near epsilon squared but that is much smaller than just epsilon. It’s all about accuracy and making sure the error that generates your implementation of back_prop is small enough to not matter in the grand scheme of things.