When to use Gradient Checking?

I understand how gradient checking works and what it is used for, however I am not clear on what types of bugs would necessitate the use of grad checking other than incorrect implementation of the mathematical formulas for gradients. But would the model even converge in such case? And if it does converge - how would one even know that a grad check should be used at all?
Any other types of bugs or scenarios that would prompt the use of grad check? And how might they manifest themselves?
thank you!

That is exactly the case that it is intended for.

That’s a good point. It would all depend on the exact nature of the bug(s), I suppose. You could imagine a case in which it converges, but not as quickly as it should because the gradient values were incorrectly computed to be lower than they should have been. But that’s only one particular type of bug that one can imagine. It’s also possible that the incorrect gradients wouldn’t even point in the right directions, so you’d get divergence no matter what learning rate you chose.

But the higher level point here is that in “real world” solutions these days, nobody builds their own gradient algorithms any more: everyone uses a framework like TensorFlow or PyTorch or the like. And in those systems, the gradients are computed for you using a technique called “numeric differentiation” or “automatic differentiation”, which is based on the very same idea as Gradient Checking, which is finite differences. So maybe the pedagogical value of this section is to point out that you can actually approximate the gradients with an algorithm, although it’s more expensive to compute than using the analytic derivatives. I have not looked into how gradients are implemented in TF, but it’s possible they include the analytic gradients for commonly used functions (e.g. all the standard activation functions).

We’ll get introduced to TF in Week 3 of DLS C2, so stay tuned for that.