C2W1 - Theory behind Gradient Checking formula?

Interesting. When I saw the title of this thread, I assumed you would be asking about why a “two-sided” versus “one-sided” finite difference is used to compute the approximation of the derivative. That’s an interesting question and the answer is discussed here.

I think there is a concrete reason for including some version of the length of one or both of the gradient vectors in the denominator of the error expression. What constitutes a “large” error depends on the sizes of the quantities you are dealing with, right? If you’re measuring the distance from here to the moon, an error of 10 meters is not too bad. If you’re measuring your thumb, that would be a big error, right? So the point is scaling. With your formulation, there is no scaling for the absolute values of the actual elements of the vectors, right?

I would think that any of the following would be valid, but you might have to make some adjustments in your threshold for success if you picked the third choice:

E = \displaystyle \frac {||grad - gradapprox||}{||grad|| + ||gradapprox||}

E = \displaystyle \frac {||grad - gradapprox||}{2 * ||grad||}

E = \displaystyle \frac {||grad - gradapprox||}{||grad||}