I would state it a bit differently: we’re trying to find the best algorithm for doing “finite difference” approximations to gradients, so we can use that to confirm that our gradient logic is correct. If you have a choice between a model that gives errors with O(\epsilon) and a model that gives errors of O(\epsilon^2), then it’s clear which one is superior, right? Remember that the whole point is that \epsilon is very small. What happens if you square 10^{-7}? Wait for it … it gets a *lot* smaller.

Now I realize that doesn’t really answer the fundamental question originally asked here, which is why the two different ways to compute finite difference derivative estimates have those error behaviors. That requires some math, but if you already know about Taylor Expansion you should be set for that. Here’s a document from Brown University which uses Taylor Series to show the behavior of one-sided and two-sided finite differences.