Hello,

I have a question about the numerical approximation and the examples prof andrew gave. I don’t understand why we take J(w+epsilon)-J(w-epsilon) rather than J(w+epsilon)-J(w) because normally the derivative is defined mathematically as the limit of J(w+epsilon)-J(w)/epsilon (epsilon–>0). I think it makes more sense than J(w+epsilon)-J(w-epsilon)/2*epsilon,didn’t it?

It’s been a while since I watched these lectures, but I’m pretty sure Andrew comments on that in the lectures. It’s the difference between a “one sided” difference and a “two sided” difference. When we are doing “real” calculus using \mathbb{R} and are taking limits as \Delta x \rightarrow 0, we don’t really have to worry because we have infinite resolution. But when we’re operating in the limited world of 64 bit floating point and have literally only 2^{64} numbers that we can represent between -\infty and +\infty, then we have to worry about this. It just turns out that when you are estimating derivatives with finite differences, the two sided differences give you better convergence behavior. This turns out to matter a lot, because packages like TF and PyTorch use numerical approximation of gradients extensively, so this behavior has been studied carefully.

Here’s a thread with a relevant question from a different course, but which shows some experiments that demonstrate the behavior. You can probably also find articles on Wolfram or MathWorks by googling “two-sided difference”.

Got it, thank you !

Yes, Prof Ng does discuss this point starting about about 1:30 in the first lecture on this in DLS C2 W1.