Why does epsilon is applied on parameters instead of X in gradient_check_n

In the programming assignment “Gradient Checking” in the function gradient_check_n, the epsilon delta is computed on the parameters W and b.

My understanding is that epsilon should be computed on x. y+= f(x+epsilon) and y-=f(x-epsilon).

If f(x)= wx+b and theta=x+epsilon, the resultant equation should be: wx + w*epsilon + b

But in the gradient_check_n, the resultant equation should look like: (w+epsilon)*x + (b+epsilon) which is very different.

Above all, w can be any real number (very big or very small) not related with the scale of x. How to choose an epsilon that give a delta meaningful in the scale of x?

In that formulation x simply stands for whatever the input to the function is. In our ML functions we have a number of inputs: the data (X and Y) and the parameters (W and b), but the point is that what we are doing in training is keeping X and Y as they are and trying to “solve” for better values of the parameters W and b. The point of gradient checking is that it is making sure we have correctly implemented the derivatives (gradients) of the parameters, so that our training efforts have a chance to work. Variable names are always just an arbitrary choice or a “convention”, but in this case Prof Ng chooses to use \theta to represent the concatenation of all the W^{[l]} and b^{[l]} values for the purposes of implementing the gradient checking algorithm. The point is that it needs to vary all of them and dealing with that with the individual layer values would get pretty complicated.

Sorry, I didn’t address that question in my first response. You’re right that the scale of the individual W and b values can potentially vary quite a bit. But for our purposes here, we are just trying to determine the correctness of our gradients, which is (at least at a pure math level) independent of the actual values of the individual elements. But maybe the more relevant point is that we are doing this before we even run the training, so we can just randomly initialize all the W and b values to a normal distribution or uniform distribution with small(ish) values of the same scale. If you are worried about the scale of the training data values, we could also do the same random initialization with consistent scale for the X and Y values, since at this point we only care about the correctness of the algorithm. We don’t need to know the answers for what the “correct” W and b values really are at this stage: we’ll learn that once we’re sure our algorithm is good and we can successfully run the training with the real training data.