The reason given appears to be that dividing by those magnitudes normalizes the output to make it of order epsilon.
However, let’s analyze the best case scenario in which d\vec\theta_{approx} carries the theoretical error of O(\epsilon^2). In that case, since each component of (d\vec\theta_{approx} - d\vec\theta) will be O(\epsilon^2), |d\vec\theta_{approx} - d\vec\theta| will be O(\epsilon^2\sqrt{N}), where N is the number of parameters in the model. This means that in order to make the given expression be of order epsilon, both d\vec\theta_{approx} and d\vec\theta need components of average order epsilon (which will almost never be the case).
Wouldn’t a more sensible formula be something like
Interesting. When I saw the title of this thread, I assumed you would be asking about why a “two-sided” versus “one-sided” finite difference is used to compute the approximation of the derivative. That’s an interesting question and the answer is discussed here.
I think there is a concrete reason for including some version of the length of one or both of the gradient vectors in the denominator of the error expression. What constitutes a “large” error depends on the sizes of the quantities you are dealing with, right? If you’re measuring the distance from here to the moon, an error of 10 meters is not too bad. If you’re measuring your thumb, that would be a big error, right? So the point is scaling. With your formulation, there is no scaling for the absolute values of the actual elements of the vectors, right?
I would think that any of the following would be valid, but you might have to make some adjustments in your threshold for success if you picked the third choice:
E = \displaystyle \frac {||grad - gradapprox||}{||grad|| + ||gradapprox||}
E = \displaystyle \frac {||grad - gradapprox||}{2 * ||grad||}
E = \displaystyle \frac {||grad - gradapprox||}{||grad||}
To state the point more simply, you want the error to be relative, not absolute. In your formulation it is relative only to the number of entries in the vectors, but it would still be absolute in terms of the magnitudes of the entries of the vectors.
Totally fair. I follow the reasoning about wanting to have some sort of relative error. Assuming that the derivatives of J are of the same order as J (which is usually fine), the formula definitely accomplishes that goal.
However, there’s still the issue of getting the formula to be O(\epsilon). The way it seems to me is that the denominator has only sub-leading order dependence on \epsilon which means that the whole expression should still be O(\epsilon^2) due to the numerator’s dependence.
The lecture doesn’t explicitly say that the result should be O(\epsilon), but it does assert that \epsilon = 10^{-7} and that the ratio given should also be \approx 10^{-7}. This means either 10^{-7} is special for some reason (which seems doubtful) or there is some factor of \epsilon cancelling somewhere.
I’m just trying to figure out how that cancellation happens or if perhaps my analysis is wrong.
Sorry, I don’t understand what the magnitude of J has to do with any of this. It appears nowhere in that formula. And even if it did, I don’t understand the mathematical point you’re making there. Suppose I have a function f(x) which is differentiable for all x. Then g(x) = f(x) + b for an arbitrary constant b has the same derivative, right?
On the actual question you’re asking here, I’ll have to go dust off my Numerical Analysis textbook and think a little harder. Stay tuned.
The one other general point worth making here is that even our evaluation of the analytic formulas for the gradients are approximations in practice, since we’re forced to compute them in a finite floating point representation. You can consult the IEEE 754 specs and see that the resolution of binary32 is O(10^{-7}) in the mantissa and for binary64 it’s O(10^{-16}) or thereabouts. The former number there may have something to do with the choice of \epsilon = 10^{-7}.
But before we really go down this rabbit hole, please note I don’t think anybody really directly does Gradient Checking anymore. You just use TensorFlow or PyTorch or your favorite ML platform and the back prop logic is done for you using some form of “automatic differentiation” typically, other than for specific functions (e.g. activations) for which the analytic derivatives are known. If you want to be a researcher and (e.g.) improve the way automatic differentiation is done, then maybe it’s worth digging deeper on this type of topic in general. But for most people this is not worth investing the effort to understand the level of question you are asking.
Prof Ng has pedagogical reasons for introducing us to these ideas, but we most likely won’t need to directly apply them ourselves.
Interesting! I most certainly will stay tuned, and I hope you can find a good answer in your reference.
Just to address your question about the magnitude of derivatives…
I am far from an expert on this and I may have misspoken, but here was what I was going for: In perturbation theory, one usually assumes that a function and its derivatives have the same order of magnitude relative to the perturbation parameter. This is obviously not always the case, but a lot of common functions seem to have this property for at least low-order derivatives in at least a portion of their domain. For this problem in its best-case scenario, the Lagrange error term in the numerator is proportional to J''' while the denominator is proportional to J'. So, more precisely, I should have said that those two derivatives are of the same order of magnitude, but I was being a little loose. I am thinking that with appropriate weight initializations both of those derivatives ought to be in the ballpark of O(1) (or at least the same order overall), making them largely irrelevant for estimating the magnitude of this quotient.
Of course, if there’s a good reason we shouldn’t assume that J''' and J' are of the same order of magnitude, then I am confused again (although it might explain the missing factor of \epsilon).
And thanks for the reality check. I’m really just interested in this because it’s the first thing I’ve encountered in the course that I couldn’t make sense of mathematically, and that’s bugging me.
EDIT:
I’ve played around in the gradient checking assignment and gathered some data, which I will describe here for completeness.
First, dividing by 2 |d\vec\theta| as you proposed does in fact work. Notably, dividing by \sqrt N also works. All three of these forms for the error (i.e. the original version and these two modifications) give roughly the same results. Square rooting or otherwise trying to eliminate a factor of \epsilon from the numerator completely screws up the result.
Second, the error reaches a minimum of E\approx10^{-10} at a value of \epsilon\approx2.2\times10^{-4}. The value of 10^{-7} does indeed appear to be special because it is the value where E\approx\epsilon.
I’m not sure what to make of all this, but the situation is certainly more complicated than I had considered.
Sorry, I bookmarked this thread with the intention of getting back to it, but then things got busy in my real life.
I think the fundamental problem here is that we aren’t dealing with physical objects dancing around in 4D SpaceTime to the tune of Newton’s Laws. We’re doing pure math here and dealing with the gradients of manifolds in extremely high dimensional vector spaces. Actually the case here is a very small one compared to most neural networks. The dimension of the vector space is the total number of individual parameters, right? If you check the test case here, it happens that they have a total of 47 parameters. So I think the reason that \sqrt{N} happened to work here is that:
\sqrt{47} = 6.855...
And it also turns out that in the example here:
||grad|| = 3.385...
Here’s my output for that test case with the correct gradient code and with a bit of instrumentation added:
So it looks like a pure accident that \sqrt{N} seemed plausible.
We’ve got 47 input dimensions and one output dimension, so we’re drawing a surface in 48 dimensional space and then computing gradients of it. Just as if we had x and y as inputs and graphed J as z, ending up with a 3D plot.
In more typical Neural Networks, it’s not at all unusual to have thousands or even millions of parameters. They say that the latest GPT4 has 1.7 trillion parameters. What does a manifold look like in \mathbb{R}^{10000} without going into the realm of the completely crazy? I have no idea, but I think it’s better to toss any intuitions we have based on the normal physics that we are used to dealing with.
I think it all comes down to the fact that we’re working in floating point here and the resolution of 32 bit floats is O(10^{-7}), so everything we are doing should have errors on that order if we scale things appropriately. So that’s why they divide by (essentially) 2 * ||grad||. Now in reality we’re actually doing 64 bit floats, so the correct code does better than that.
The other thing to consider here is the test cases. Notice that what they do is just use np.random.randn “as is” to generate all the inputs, which means we’re seeing parameter values from a Gaussian distribution with \mu = 0 and \sigma = 1. Is that representative of real situations? Typically initialization algorithms start with smaller values. Also notice in the lectures that Prof Ng recommends doing Gradient Checking early in training (not as part of training, but checking the trained results) and after running training for a while, so that things settle down a bit. Well, if your backprop is wrong, who knows if “settling” will actually happen.
I generalized the test code a little like this, so that we can experiment with the magnitudes of the inputs:
I am a little confused by your response, though. I understand that a bad initialization can lead to a bad result, but I don’t understand how that relates to the issue of the error.
It’s still unclear to me why the error takes on values that it does as \epsilon varies. I am willing to believe that it is due entirely to machine precision, but that’s a different problem than having a bad initialization (which is also affected by machine precision).
We’re not actually trying to run the training here, so my point was not whether the initializations are good or bad from that point of view. I was just trying to reason about the properties of the function that we have defined here to do gradient checking. So far it doesn’t perform the way I would have hoped. I have not yet gotten to modifying the value of \epsilon to see what effect that has. That was going to be my next set of experiments: to see if modifying the \epsilon can generate better results with larger inputs. In other words my guess is that there is a coupling between the magnitude of the inputs and that of \epsilon, although that’s a bit unfortunate meaning it’s yet another thing you have to tune.
Of course as I commented earlier, this whole exercise is not really that useful, since we don’t actually do this in practice. But it does make me curious about how they make automatic differentiation work.
Thank you for all the work you have put in so far. This is obviously a very low-priority question, but if you continue working on it and find out more interesting information, I will be interested to see those results.