Gradient Check Error Threshold - Theory

juDas · February 4, 2022, 9:50pm

In both the course lecture and the exercises, we see an acceptable grad error threshold of roughly 10^-7, which is on the same order as epsilon. This can’t be a coincidence, and the lecture doesn’t give much for concrete explanation of why that error threshold is acceptable.

Can someone help with some theory here to help me build an intuition for how to gauge acceptable grad error thresholds in the future where epsilon will not always be 10^-7?

Thank you!

Rashmi · April 30, 2022, 9:44am

Hi JuDas,

Welcome to the community!

The epsilon value 10^-7 that you came across while doing the lecture is just a common value used for the difference between analytical and numerical gradient. Well, you see, what is the point here, if the difference is less than the given value 10^-7 then the implementation of the back propagation is correct and if it’s not the case, then you need to go through the process again to make it right.

Here’s a random link from web search that would apparently make your point clear how and in what case, we use this epsilon value.

paulinpaloalto · April 30, 2022, 7:24pm

That article gives an explanation of how gradient checking works, but doesn’t really address the question of how you choose a reasonable value for both \epsilon and then the criterion for what value of the two sided difference represents an indication that you need to look for bugs in your back propagation calculations.

I think the high level point is that the choice of the \epsilon value is somewhat arbitrary. Notice that what they do in the notebook is that, having selected 10^{-7} as the value of \epsilon, they then use 2\epsilon as the error threshold. I have not looked for any papers gradient checking, but just thinking about it from basic calculus principles, you would need to base that choice of \epsilon on some estimate of how “crazy” the behavior of the function is in a small regions. The problem here is that our intuitions are mostly based on the relatively smooth and well behaved functions that we see in physics. But the actual full spectrum of what is mathematically possible can get pretty wild: functions can cover a large range of values in a very small domain. But it makes sense that the error threshold is within the same order of magnitude as the \epsilon value you choose. E.g. 5\epsilon is probably just as good a threshold as 2\epsilon.

One way to dig deeper on this whole question would be to look into the \epsilon values used by TensorFlow or PyTorch when they do “automatic differentiation”. The gradient tape mechanism in TF and the equivalent in torch are just doing the same kind of finite differences that we are using for Gradient Checking here. I’ll see if I can find any info on that. Stay tuned!

Rashmi · May 1, 2022, 2:35am

Thank you for input, Paul sir!

Looking forward to view more of your valuable comments on this query. This question seemed quite interesting to me and I was just trying to give it a better score

Topic		Replies	Views
Grad check threshold Improving Deep Neural Networks: Hyperparameter tun	3	583	April 20, 2021
C2W1 Grad Check: why is the Epsilon used to estimate grads also used as threshold in the check? Improving Deep Neural Networks: Hyperparameter tun	2	353	October 11, 2023
Question regarding Gradient Checking Improving Deep Neural Networks: Hyperparameter tun	2	453	June 25, 2023
Gradient checking epsilon vanishes for initial layer in deep NN Improving Deep Neural Networks: Hyperparameter tun week-1	6	18	February 8, 2025
Week1: Gradient Checking Improving Deep Neural Networks: Hyperparameter tun	6	533	July 7, 2022

Gradient Check Error Threshold - Theory

Related topics