Is this another error by Andrew?

In this course in Week 2 → Back Propagation (optional) → What is a derivative (Optional) video, Andrew computes the derivative of \text{J(w)} = w^2 as 2w, which is mathematically correct.

For w = 3, the derivative is correctly given as 6.

Andrew notes that the actual VALUE of J(w) increases from 9 to 9.006001 when w is increased from w = 3 to w = 3.001. This is an increase in the value of J(w) above 9 equal to six times the 0.001 increase in w from w = 3 to w = 3.001. This sixfold increase is unrelated to the value of the derivative at w = 3.

If we instead increase w by say 0.002, we get an increase in J(w) from 9 to 9.018008999999997 which is an increase in J(w) above 9 of approximately 9 times 0.002 which is not 6 times 0.002.

So Andrew’s generalisation at 6mins 20secs;

\;\;\;\;\text{if} \; w\uparrow \epsilon \: \text{causes} \: J(w)\uparrow\;\mathit{k \times\epsilon}\;\text{then}

\large\;\;\;\;\;\;\;\;\frac{\partial}{\partial w}\mathit{J(w)}|_{w = 3}=\mathit{k}

does not work for values of \epsilon other than 0.001.

Hi @ai_is_cool

Derivatives represent the instantaneous rate of change, not an exact prediction over larger intervals. Andrew’s generalization works only when \epsilon \to 0, meaning for infinitesimally small changes in w. So while

\frac{J(w + \epsilon) - J(w)}{\epsilon} \approx 6

for very small \epsilon, like 0.001, this approximation is less accurate for larger \epsilon like 0.002. That’s why the increase no longer scales exactly by the derivative when \epsilon is not small enough — because you’re leaving the “local linear” zone where the derivative is valid.

Hope it helps! Feel free to ask if you need further assistance.

1 Like

What do you mean by “…prediction…” when you are talking about the derivative?

What’s a “…”local linear” zone…”. It’s not mentioned in Andrew’s video presentation?

Actually, I think I now understand what Andrew is getting at.

He uses an example of J(w) = w^{2} to show that the smaller \epsilon becomes the closer the increase in J(w) as a multiple of \epsilon comes to the derivative of J(w) at x = 3.

But what does all this have to do with ADAM?

Hi again @ai_is_cool,

By prediction I mean the linear approximation that the derivative provides. The derivative at a point gives the instantaneous rate of change that can be interpreted as the slope of the tangent line to the function at that point. This slope predicts how much the function would increase for an infinitesimally small increase in w.

Also for the “local linear zone”, that’s just a way to say the region around w=3 where the function behaves almost like a straight line (its tangent).

ADAM uses gradients (which are derivatives) to update parameters. We know that derivatives are local linear approximations — they only perfectly describe the function’s change at infinitesimally small steps. Since ADAM takes many small steps guided by these gradients, understanding this “local linearity” is important for how these updates work effectively.

Hope it helps! Feel free to ask if you need further assistance.

I still don’t know what mean by “…linear approximation…”.

Evaluation of the derivative for a particular value of the independent variable gives the exact rate of change of the dependent variable with respect to the independent variable at that value of the independent variable.

Basic calculus.

Near a specific point (like w=3), the function J(w) = w^2 can be closely matched by a straight line — the tangent line at that point.

So think of zooming in super close on the curve at w=3. At that tiny scale, the curve looks almost like a straight line. This straight line’s slope is exactly the derivative at that point.

What has that got to do with Andrew’s generalisation?