[Gradient check] Why isn't the absolute value of `grad - gradapprox` the difference between gradients?

paul2048 · July 6, 2021, 7:01pm

In the 3rd assignment for week 1 the following formula is used to compute the relative difference between ‘gradapprox’ and the ‘grad’:

difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2}

Why isn’t np.abs(grad - gradapprox) used instead? You can still compare that to 1e-7 or any other small number.

paulinpaloalto · July 6, 2021, 8:53pm

Remember that grad and gradapprox are both vectors, right? So what is the absolute value of a vector? It is a vector of absolute values. The 2-norm of a vector is the Euclidean length of the vector, which is a scalar. You want a scalar value for the difference.

You don’t have to take my word for this or wonder what np.abs(grad - gradapprox) gives you. Why don’t you try it and see what you get?

Well, I’m talking about the “general case” here. If you are talking about the 1D case, then either will give the same answer in that case. But the 1D case is not really interesting. They only show it to introduce the concepts and if you’re going to do that, doing it differently for the 1D case would sort of defeat the pedagogical point.

paul2048 · July 7, 2021, 9:53am

Thank you for the response! Yes, I was talking about the 1D case.
Yes, np.abs(grad - gradapprox) returns a vector, but what if we use sum(np.abs(grad - gradapprox)) / grad.shape[0]. This way you would get the average gradient difference, right? I know the two methods would give 2 different values, but why would the 2-norm method be used instead?

paulinpaloalto · July 7, 2021, 2:24pm

When dealing with vectors and you want a metric, the L2 norm (Euclidean distance) is usually preferred over the L1 norm (absolute value of the difference) because of its mathematical properties:

It’s differentiable, which absolute value is not at z = 0.
It gives much higher penalties to larger errors and gives lower penalties for very small errors (<< 1).

Understanding why point 2) is important is clearer if you think about the gradients: the derivative of z^2 is 2z of course and the derivative of |z| is either -1 or 1 depending on whether z < 0 or z > 0. So for the L2 case, the gradient is giving a stronger “push” towards the right answer the farther away you are, whereas L1 doesn’t really care and gives the same level of “push” no matter how big or small the error is.

henrikh · August 25, 2021, 4:53pm

Interesting! But why do we divide by the sum of the norms and not just by the norm of the “reference” exact value “grad”? I used to always divide by the “reference”. Here is what I mean Approximation error - Wikipedia
It is applied to L1 norm but the idea is the same.

Thanks
Henrikh

paulinpaloalto · August 25, 2021, 5:10pm

The point of the division is to take the scale of the values into account. The absolute difference by itself doesn’t tell you anything, right? If the difference has length 1 and each of the vectors have length > 10^6, then that’s a pretty small error. But what if the absolute difference is 1 and the vectors are each of length 0.75? The reason you take the sum is that the lengths of both vectors matter. Think of it as the relative error.

paulinpaloalto · August 25, 2021, 5:31pm

With \epsilon more thought, the other point is that if you want to do a more pure version of approximation error, how do you know which to use as the reference value? You can’t use grad, right? The whole point of this algorithm is that we can’t assume it is correct. I guess you could use grad_approx as the reference value and arguably that would be just as legitimate and perhaps closer to Approximation Error. I’m not sure why they have chosen the sum, but that seems to be the usual way of computing the success criterion in gradient checking.

henrikh · August 26, 2021, 8:41am

Thanks! Now I see. For me, the “relative difference” was purely a “relative change” calculation, but “relative change” and “relative difference” weren’t the same. If we calculate the relative difference, we simply do not know if “grad” or gradapprox" is the exact value of reference. In the contrary, if it is a relative change, we have to take one of the magnitudes as the reference, because otherwise what would be the original magnitude we measure the “distance” from.
I found this Wikipedia article very insightful!

paulinpaloalto · August 26, 2021, 2:54pm

Thanks for the link, which gives a clear statement of the distinction between relative change and relative difference! Having seen that explanation, it makes sense that they chose relative difference for their purpose here.

Topic		Replies	Views
What's meaning of " np.linalg.norm(grad) + np.linalg.norm(gradapprox)" Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	570	September 1, 2022
Grad check error Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	519	January 13, 2023
Week1 Programming Assignment: Gradient Checking Improving Deep Neural Networks: Hyperparameter tun coursera-platform	13	1167	June 21, 2024
Lab3, week 1, Gradient checking error Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	1	259	February 12, 2024
C2W1 - Theory behind Gradient Checking formula? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	11	557	August 7, 2023

[Gradient check] Why isn't the absolute value of `grad - gradapprox` the difference between gradients?

Related topics