Gradient Checking doubts

Question 1.
If Gradient checking with a two-sided limit is so accurate(and we know it can’t mess up like our backprop equations), why isn’t that what we do always to calculate the derivative? I ask this because for some reason visually the formula looks to me to be faster than these backdrop formulas below?
dW[l]=np.dot(dZ[l], A[l-1].T)
db[l]=(1/m)*np.sum(dZ[l], axis=1, keepdims=True)
Visually I can’t seem to figure out how backpropagation equations are faster than using the limit formula. And the difference in accuracy is so small(order of 10^-7), it feels like it should be insignificant?
I would appreciate it if someone could show me how the time complexity of the gradient checking is worse than that of normal backpropagation.

Question 2.
I don’t seem to get how Weights and biases will be reshaped to theta. And if J(theta) is indeed calculated how does that relate to db,dW which we get via backpropagation. I understood that db and dW would be reshaped into dtheta too, but what exactly does it mean. Is theta[l]=[W[l],b[l]]

Hi, @Jaskeerat.

You’ll actually get to implement gradient checking at the end of week 1!

Intuitively, with backprop you’re doing a forward pass and a backward pass. With gradient checking, you are doing a couple of forward passes for every trainable parameter in your network while you nudge the parameter in question and keep everything else the same. This is very inefficient.

Theta is just a list of every individual weight and bias term. J(theta) is the cost as a function of theta, and what you’re computing are the derivatives of J with respect to every w and b, which correspond to the elements of the different dW[l] and db[l] that you’re familiar with.

I think you’ll enjoy the gradient checking assignment :slight_smile:

2 Likes