So after knowing some gradients calculated during back propagation are different from the ones calculated by estimation, what exactly do we do? My guess is we could try to take differences of individual entries from the long theta vector with gradapprox vector and then find which ones are greater than some threshold say 1e-5. After that we could replace the problematic ones with values from gradapprox and then try to perform gradient descent for better performance. Any input is welcome!
No, it’s not really about analyzing individual elements of the gradapprox
vector. The point is that the fact that the “check” value is above the threshold means that there are bugs in your back propagation logic. So the next step is to carefully examine that logic to find the bugs. They put them there on purpose and they should be pretty easy to spot in this example case, even if that might not be so true in “real life”.
Hi Shubham,
Thank you for asking this question, it is really thought-provoking and helps me to gain a more clear understanding of gradient checking (grad check) in practice.
Yes, as you mentioned, we compare the difference between long theta vector and gradapprox and we need a threshold to measure if the difference is significant. One thing Andrew mentioned in the course video (DLS C2-W1: Setting Up your Optimization Problem: Gradient Checking) is to compute the normalized Euclidean distance and check that value, instead of directly looking at the absolute distance, he also discussed on three levels: 1e-7: great; 1e-5: careful look, double check; 1e-3: worried, look at the individual component to check.
I also noticed you mentioned replacing the problematic ones with values from gradapprox. In my opinion, however, we should only use grad check to debug, not in training. Here are more implementation details in the course video (link).
Happy to discuss and learn with you. Please correct me if you have any questions!
Best,
Kezhen
Appreciate the responses!
I think I misunderstood the main motive behind gradcheck.
So, if I am understanding this correctly, gradcheck provides a method to check whether the logic (math) has been implemented correctly, right?
Also, how would we correct exploding gradients (both 0 and +/- inf)? Do we then try to improve the initialisation to get better scaled random numbers so as to keep the gradients reasonably finite?
Yes, gradient checking is just a way to confirm that your back propagation logic is correctly programmed. What happens when you train is then a completely separate set of issues, but it’s hopeless until you make sure at least your code is correct. The other issues you mention above (vanishing or exploding gradients) are addressed by Prof Ng as we continue through Course 2. Please stay tuned and listen to what he says.