Remember that grad and gradapprox are both vectors, right? So what is the absolute value of a vector? It is a vector of absolute values. The 2-norm of a vector is the Euclidean length of the vector, which is a scalar. You want a scalar value for the difference.
You don’t have to take my word for this or wonder what np.abs(grad - gradapprox) gives you. Why don’t you try it and see what you get?
Well, I’m talking about the “general case” here. If you are talking about the 1D case, then either will give the same answer in that case. But the 1D case is not really interesting. They only show it to introduce the concepts and if you’re going to do that, doing it differently for the 1D case would sort of defeat the pedagogical point.
Thank you for the response! Yes, I was talking about the 1D case.
Yes, np.abs(grad - gradapprox) returns a vector, but what if we use sum(np.abs(grad - gradapprox)) / grad.shape[0]. This way you would get the average gradient difference, right? I know the two methods would give 2 different values, but why would the 2-norm method be used instead?
When dealing with vectors and you want a metric, the L2 norm (Euclidean distance) is usually preferred over the L1 norm (absolute value of the difference) because of its mathematical properties:
It’s differentiable, which absolute value is not at z = 0.
It gives much higher penalties to larger errors and gives lower penalties for very small errors (<< 1).
Understanding why point 2) is important is clearer if you think about the gradients: the derivative of z^2 is 2z of course and the derivative of |z| is either -1 or 1 depending on whether z < 0 or z > 0. So for the L2 case, the gradient is giving a stronger “push” towards the right answer the farther away you are, whereas L1 doesn’t really care and gives the same level of “push” no matter how big or small the error is.
Interesting! But why do we divide by the sum of the norms and not just by the norm of the “reference” exact value “grad”? I used to always divide by the “reference”. Here is what I mean Approximation error - Wikipedia
It is applied to L1 norm but the idea is the same.
The point of the division is to take the scale of the values into account. The absolute difference by itself doesn’t tell you anything, right? If the difference has length 1 and each of the vectors have length > 10^6, then that’s a pretty small error. But what if the absolute difference is 1 and the vectors are each of length 0.75? The reason you take the sum is that the lengths of both vectors matter. Think of it as the relative error.
With \epsilon more thought, the other point is that if you want to do a more pure version of approximation error, how do you know which to use as the reference value? You can’t use grad, right? The whole point of this algorithm is that we can’t assume it is correct. I guess you could use grad_approx as the reference value and arguably that would be just as legitimate and perhaps closer to Approximation Error. I’m not sure why they have chosen the sum, but that seems to be the usual way of computing the success criterion in gradient checking.
Thanks! Now I see. For me, the “relative difference” was purely a “relative change” calculation, but “relative change” and “relative difference” weren’t the same. If we calculate the relative difference, we simply do not know if “grad” or gradapprox" is the exact value of reference. In the contrary, if it is a relative change, we have to take one of the magnitudes as the reference, because otherwise what would be the original magnitude we measure the “distance” from.
I found this Wikipedia article very insightful!
Thanks for the link, which gives a clear statement of the distinction between relative change and relative difference! Having seen that explanation, it makes sense that they chose relative difference for their purpose here.