Below are the points we dont understand from the lecture Gradient checking Implementation Notes. Can u please help to understand ? We cannot understand why grad check at random intilization and also after training network for a while again grad check why so sir ? @bahadir @nramon @eruzanski @javier @marcalph @elece
It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don’t do this very often, but one thing you could do is run grad check at random initialization and then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values.And then run grad check again after you’ve trained for some number of iterations
Im Still unable to understand the whole bold highlighted paragraph. Can u please help to understand pls?
Especially , It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization
Also how w and b larger as the training progress ? w and b getting reduce right because w =w - alpha * dw
Dear Mentor, can u please help to understand this ?
It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don’t do this very often, but one thing you could do is run grad check at random initialization and then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values.And then run grad check again after you’ve trained for some number of iterations
@nramon Im unclear about ur reply sir. A buggy implementation of back prop could work at random initialization means how the gradient checking would be helpful runnning the grad check at random initialization because the implementation of back prop is going to be correct right at w and b close to zero so how the grad check going to be helpful at the random initialization ?
In that case, running grad check after random initialization would not be helpful. Instead, you could train the network for a while so that w and b have some time to wander away from 0 and then run grad check