Gradient Check Error Threshold - Theory

In both the course lecture and the exercises, we see an acceptable grad error threshold of roughly 10^-7, which is on the same order as epsilon. This can’t be a coincidence, and the lecture doesn’t give much for concrete explanation of why that error threshold is acceptable.

Can someone help with some theory here to help me build an intuition for how to gauge acceptable grad error thresholds in the future where epsilon will not always be 10^-7?

Thank you!

Hi JuDas,

Welcome to the community!

The epsilon value 10^-7 that you came across while doing the lecture is just a common value used for the difference between analytical and numerical gradient. Well, you see, what is the point here, if the difference is less than the given value 10^-7 then the implementation of the back propagation is correct and if it’s not the case, then you need to go through the process again to make it right.

Here’s a random link from web search that would apparently make your point clear how and in what case, we use this epsilon value.

That article gives an explanation of how gradient checking works, but doesn’t really address the question of how you choose a reasonable value for both \epsilon and then the criterion for what value of the two sided difference represents an indication that you need to look for bugs in your back propagation calculations.

I think the high level point is that the choice of the \epsilon value is somewhat arbitrary. Notice that what they do in the notebook is that, having selected 10^{-7} as the value of \epsilon, they then use 2\epsilon as the error threshold. I have not looked for any papers gradient checking, but just thinking about it from basic calculus principles, you would need to base that choice of \epsilon on some estimate of how “crazy” the behavior of the function is in a small regions. The problem here is that our intuitions are mostly based on the relatively smooth and well behaved functions that we see in physics. But the actual full spectrum of what is mathematically possible can get pretty wild: functions can cover a large range of values in a very small domain. But it makes sense that the error threshold is within the same order of magnitude as the \epsilon value you choose. E.g. 5\epsilon is probably just as good a threshold as 2\epsilon.

One way to dig deeper on this whole question would be to look into the \epsilon values used by TensorFlow or PyTorch when they do “automatic differentiation”. The gradient tape mechanism in TF and the equivalent in torch are just doing the same kind of finite differences that we are using for Gradient Checking here. I’ll see if I can find any info on that. Stay tuned!

Thank you for input, Paul sir!

Looking forward to view more of your valuable comments on this query. This question seemed quite interesting to me and I was just trying to give it a better score :slight_smile: