I have watched professor Andrew said that to imply gradient checking we will make all the Weights including of previous layer into on vector and then for each W we add little value epsilon to check our gradient
now for previous layers (layer that’s closer to the input layer) we apply activation function to them suppose hypothetically the activation is z^2 i.e. it squares the
so for deeper layer the epsilon we add will decrease exponentially until it changes loss function since epsilon so it maybe to small that it we can ommit it when it reaches the last layer since as in image below for formula
You are correct.
Gradient checking is really only useful if you’re trying to debug a totally new cost function.
There is rarely a need for this.
thanks so it’s true that maybe for the first layer the gradient check will make no difference as it vanishes from first layer to the last layer?
I would put it a bit differently. The point is that you are modifying the inputs in every dimension by a small amount. Then you are computing an approximation to the derivative by using the “finite difference” method as Prof Ng showed in the lectures. Now the question is whether your implementation of all the derivatives in the back propagation steps are correct or not. We are comparing the “finite difference” approximation of the derivative to the actual gradients computed by your “back prop” logic. They should be close. If they are not, then it indicates there is a bug in your back prop logic.
But exactly as Tom says, this method is not something that we really ever use. It’s more just background to understand how things have developed over time. Once we graduate to using TensorFlow (or any other ML platform like PyTorch or a dozen others), we no longer even have to worry about writing the back prop logic: it is handled for us by the platform. In fact, in some cases TF or PyTorch is actually doing something that is similar in principle to the finite difference method to actually compute the gradients. That’s in cases in which it’s not possible to take the derivative in “closed form”.
okay thanks I got that
but let me put the question in a more clear form
mathematically speaking if we have a function f
function f(w) = w ^2
and function w(x) = x^2
if we make a small change epsilon and added it to w
so f(w + epsilon) it will be (w+epsilon)^2 if epsilon is small then the square will make it small i.e. f(w+epsilon)
if we change x by adding epsilon it will also be small
i.e. f((x+epsilon)^2) so change in regard to w will differ from change with regard to x in that case differencet between f(w+epsilon) - f(w) will be smaller than
f((x+epsilon)^2) - f(x^2) and so on if we increase layer of functions
so if we set a difference of .0001 for example it may be correct for the f(w+epsilon) but not for f((x+epsilon)^2)
Ok, but the difference is what the difference is. And note that we are dealing with layers of a neural network here, so they typically don’t look like z^2 or z^4, right? They are linear layers with a non-linear activation function like ReLU or sigmoid or tanh. But the w and b values can have small absolute values.
And you get to choose the value of \epsilon that you use and also the threshold for what counts as a good enough approximation.
Have you worked the assignment about Gradient Checking yet? They show you mistakes in the back propagation process, but the ones they show you are pretty big mistakes. You can also experiment with making relatively smaller mistakes in the back propagation process and then see how it works with different sized values of \epsilon in order to detect the error.
So instead of speculating about what might happen, you can actually try some experiments and see what really happens. Science!
I can’t thank you enough for your help
I’ve finished the assignment but my OCD let me think like that haha but I got that this is very small and the activation though not linear but won’t potentiate epsilon with each layer
I will move forward and try to be more focus