Why Grad Check at random initialization

From the letcure video https://www.coursera.org/learn/deep-neural-network/lecture/6igIc/gradient-checking-implementation-notes,

why gradient checking needs to be implemented at random initialization when the implementation of gradient descent is correct when w and b close to zero ? How grad check is going to be helpful in this case ?

Gradient checking just verifies whether your code computes the correct gradients.

1 Like

Right, you don’t need to do gradient checking on a “trained” network. As Tom says, it’s just checking the correctness of the algorithm and how you implemented the derivatives. So you can just randomly initialize all the weights to non-zero values and then run Gradient Checking. It’s not an iterative process: you just run one pass through forward and back prop with one batch of data to make sure your algorithm is correct. Once you are sure about that, then you run the real training.

Hi Sir @paulinpaloalto @TMosh actually my doubt is wrt to the below statement sir

Proff andrew ng told that

Finally, this is a subtlety. It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large.

The highlighted bold portion im not getting sir … and why do we need to run grad check if the implementation of gradient descent is correct at random initialization ?

Gradient checking is only done when you have written a totally new cost function, and you need to verify whether it gives gradient and cost values that are consistent.

That is the only time you run gradient checking.

I do not understand Andrew’s statement, other than to clarify that checking when w and b are zero (or very close to zero) is not a very robust test, and you should not rely on that alone.

I agree with Tom that I don’t understand what Prof Ng is saying here. Maybe it’s a lack of imagination on my part, but I’m having a hard time imagining an algorithm that would work very well for small values but not be correct for larger values.

The other point is that just because you use random initialization doesn’t mean that the values are small, right? You can start with a Gaussian distribution with \mu = 0 and \sigma = 1 and multiply by whatever you want. If you’re worried about the values potentially being small, multiply by 42. Or just use the raw \sigma = 1 distribution: 1 is not close to zero in the sense he means. Of course you wouldn’t do that for normal random initialization (it would mess over your convergence), but if the purpose is just to run Gradient Checking, then that would solve the concern that I guess he’s expressing here.