In the second slide from gradient checking (p.38 of the entire C2_W1), the instructor normalizes the
L2 norm to avoid the nominator being too large. He divides it by the sum of the two vector’s L2 norm.
How could this operation normalize the result? Is there any mathematical theory to support or explain this?
Yes, but it’s not anything deep or subtle. The point is to “scale” the error. Suppose the length of the difference vector (the numerator of that expression) is 0.5. How do you know if that is a big error or not? If the actual correct vector you are trying to approximate has a length of 10^6, then 0.5 is a pretty small error. On the other hand, if the norm of the actual vector is 2, then 0.5 is a pretty big error.