Why does epsilon is applied on parameters instead of X in gradient_check_n

Pierre-Yves_Langlois · October 27, 2023, 2:05pm

In the programming assignment “Gradient Checking” in the function gradient_check_n, the epsilon delta is computed on the parameters W and b.

My understanding is that epsilon should be computed on x. y+= f(x+epsilon) and y-=f(x-epsilon).

If f(x)= wx+b and theta=x+epsilon, the resultant equation should be: wx + w*epsilon + b

But in the gradient_check_n, the resultant equation should look like: (w+epsilon)*x + (b+epsilon) which is very different.

Above all, w can be any real number (very big or very small) not related with the scale of x. How to choose an epsilon that give a delta meaningful in the scale of x?

paulinpaloalto · October 27, 2023, 3:28pm

In that formulation x simply stands for whatever the input to the function is. In our ML functions we have a number of inputs: the data (X and Y) and the parameters (W and b), but the point is that what we are doing in training is keeping X and Y as they are and trying to “solve” for better values of the parameters W and b. The point of gradient checking is that it is making sure we have correctly implemented the derivatives (gradients) of the parameters, so that our training efforts have a chance to work. Variable names are always just an arbitrary choice or a “convention”, but in this case Prof Ng chooses to use \theta to represent the concatenation of all the W^{[l]} and b^{[l]} values for the purposes of implementing the gradient checking algorithm. The point is that it needs to vary all of them and dealing with that with the individual layer values would get pretty complicated.

paulinpaloalto · October 27, 2023, 3:34pm

Sorry, I didn’t address that question in my first response. You’re right that the scale of the individual W and b values can potentially vary quite a bit. But for our purposes here, we are just trying to determine the correctness of our gradients, which is (at least at a pure math level) independent of the actual values of the individual elements. But maybe the more relevant point is that we are doing this before we even run the training, so we can just randomly initialize all the W and b values to a normal distribution or uniform distribution with small(ish) values of the same scale. If you are worried about the scale of the training data values, we could also do the same random initialization with consistent scale for the X and Y values, since at this point we only care about the correctness of the algorithm. We don’t need to know the answers for what the “correct” W and b values really are at this stage: we’ll learn that once we’re sure our algorithm is good and we can successfully run the training with the real training data.

Topic		Replies	Views
Week 1 Gradient Checking - Ex4 gradient_check_n Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	2	433	February 28, 2024
Gradient checking epsilon vanishes for initial layer in deep NN Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	6	31	February 8, 2025
C2W1 Grad Check: why is the Epsilon used to estimate grads also used as threshold in the check? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	368	October 11, 2023
C2W1 - Theory behind Gradient Checking formula? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	11	572	August 7, 2023
Question regarding Gradient Checking Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	466	June 25, 2023

Why does epsilon is applied on parameters instead of X in gradient_check_n

Related topics