The idea behind gradient checking is to estimate the gradients numerically using finite differences. It is about making small perturbations to each parameter individually and observing how the cost function changes.
But the thing I don’t understand is the relation in analytical gradients and numerically approximated gradients, how exactly does comparing analytical gradients (by backpropagation) and numerically approximated gradients (using finite differences) help identifying the correctness of backprop?
I want to know whats happening in depth.

The purpose of gradient checking is to verify that your analytical gradients are correct. And even if you have them correct from a mathematical p.o.v., you still need to write the code correctly as well. So how could you go about verifying that in a “mechanical” way, as opposed to just studying the code really carefully? Gradient checking compares the results of the “finite difference” approximations to the output of your backprop code. If your code is correct, then the two values should be approximately the same. There is a little thought needed to figure out what an appropriate error threshold is, since everything we are doing here is an approximation by definition: the formulas may be “pure math”, but our code can only deal with floating point numbers, which are a finite representation.

Note that this method is introduced more from a pedagogical point of view than a practical one: no-one really does this “by hand” anymore. The state of the art is way too complex for anyone to implement all their own code these days: you just use one of the “frameworks” like TensorFlow (Prof Ng’s choice here in DLS) or PyTorch or Kaffe or one of many other choices. Within those frameworks, they have implemented the analytic derivatives for known functions like the activation functions and loss functions and handle the rest using “automatic differentiation” which is (at least at a high level) based on the same “finite difference” idea used here to implement gradient checking.

1 Like

The gradients are the change in cost vs the change in weights.
That’s dJ / dw.
Calculus gives you the exact values. That’s what backpropagation calculates.
You can approximate dJ / dw by using small changes in w, and computing the change in J, without using the gradients directly.

1 Like