Gradient Checking Implementation Notes

Hi Mentor,

Below are the points we dont understand from the lecture Gradient checking Implementation Notes. Can u please help to understand ? We cannot understand why grad check at random intilization and also after training network for a while again grad check why so sir ?
@bahadir
@nramon
@eruzanski
@javier
@marcalph
@elece

It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don’t do this very often, but one thing you could do is run grad check at random initialization and
then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values.And then run grad check again after you’ve trained for some number of iterations

Hi, @Anbu.

The reason is stated in the paragraph you posted :slight_smile:

A buggy implementation of backpropagation could work at initialization and become inaccurate as training progresses.

Let me know what part is not clear.

Hi Sir,

Im Still unable to understand the whole bold highlighted paragraph. Can u please help to understand pls?

Especially , It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization

Also how w and b larger as the training progress ? w and b getting reduce right because w =w - alpha * dw

@neurogeek
@marcalph
@lucapug
@javier
@matteogales

Dear Mentor, can u please help to understand this ?

It is not impossible, rarely happens, but it’s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don’t do this very often, but one thing you could do is run grad check at random initialization and
then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values.And then run grad check again after you’ve trained for some number of iterations

If the derivatives are positive, yes. What happens when they are negative?

image

Here’s the link to the gradient descent lecture, in case you find it helpful :slight_smile:

@nramon Im unclear about ur reply sir. A buggy implementation of back prop could work at random initialization means how the gradient checking would be helpful runnning the grad check at random initialization because the implementation of back prop is going to be correct right at w and b close to zero so how the grad check going to be helpful at the random initialization ?

Hi, @Anbu.

In that case, running grad check after random initialization would not be helpful. Instead, you could train the network for a while so that w and b have some time to wander away from 0 and then run grad check :slight_smile: