Gradient Checking Normalization

Why do we need to divided ||dθapprox - dθ|| by the lengths of these two vectors ||dθapprox||+||dθ||? what will happen if not do so?

This would be done to make the formula work for small and large values.

For example
Assume we want to measure difference between d1 and d2 (dThetaApprox, and dTheta here)

Case 1
d1 = 10
d2 = 110
The difference here is 100, but d2 is 10 Times d1. A huge difference
Check = ((110-10)^2)/((110)^2+10^2) = 0.81

Case 2
d1 = 1010
d2 = 1110
Difference is the same, 100. But the difference in terms of scale is much lesser.
Check = ((1110-1010)^2)/(1110^2+1010^2) = 0.004

This is just a scaling factor that takes into account the actual values of the data, and scales the difference such that we can make sense and compare two different check even when they are orders of magnitude apart.

It can be roughly imagined as the percentage of difference.

1 Like