Is this another error by Andrew?

In this course in Week 2 → Back Propagation (optional) → What is a derivative (Optional) video, Andrew computes the derivative of \text{J(w)} = w^2 as 2w, which is mathematically correct.

For w = 3, the derivative is correctly given as 6.

Andrew notes that the actual VALUE of J(w) increases from 9 to 9.006001 when w is increased from w = 3 to w = 3.001. This is an increase in the value of J(w) above 9 equal to six times the 0.001 increase in w from w = 3 to w = 3.001. This sixfold increase can be seen from looking at the derivative from first principles;

\;\;\;\;\frac{\partial J(w)}{\partial w} = \lim\limits_{\epsilon \to 0}\;\frac{(w + \epsilon)² - w²}{\epsilon}

\;\;\;\;\;\;\;\;\;\;\;\; \lim\limits_{\epsilon \to 0}\frac{w² + 2w\epsilon + \epsilon² - w^2}{\epsilon}

\;\;\;\;\;\;\;\;\;\;\;\; \lim\limits_{\epsilon \to 0}\frac{2w\epsilon + \epsilon²}{\epsilon}

\;\;\;\;\;\;\;\;\;\;\;\; \lim\limits_{\epsilon \to 0} 2w + \epsilon

So for w = 3, this evaluates to 6.

This needs correction - update to follow.

Hi @ai_is_cool

Derivatives represent the instantaneous rate of change, not an exact prediction over larger intervals. Andrew’s generalization works only when \epsilon \to 0, meaning for infinitesimally small changes in w. So while

\frac{J(w + \epsilon) - J(w)}{\epsilon} \approx 6

for very small \epsilon, like 0.001, this approximation is less accurate for larger \epsilon like 0.002. That’s why the increase no longer scales exactly by the derivative when \epsilon is not small enough — because you’re leaving the “local linear” zone where the derivative is valid.

Hope it helps! Feel free to ask if you need further assistance.

1 Like

What do you mean by “…prediction…” when you are talking about the derivative?

What’s a “…”local linear” zone…”. It’s not mentioned in Andrew’s video presentation?

Actually, I think I now understand what Andrew is getting at.

He uses an example of J(w) = w^{2} to show that the smaller \epsilon becomes the closer the increase in J(w) as a multiple of \epsilon comes to the derivative of J(w) at x = 3.

But what does all this have to do with ADAM?

Hi again @ai_is_cool,

By prediction I mean the linear approximation that the derivative provides. The derivative at a point gives the instantaneous rate of change that can be interpreted as the slope of the tangent line to the function at that point. This slope predicts how much the function would increase for an infinitesimally small increase in w.

Also for the “local linear zone”, that’s just a way to say the region around w=3 where the function behaves almost like a straight line (its tangent).

ADAM uses gradients (which are derivatives) to update parameters. We know that derivatives are local linear approximations — they only perfectly describe the function’s change at infinitesimally small steps. Since ADAM takes many small steps guided by these gradients, understanding this “local linearity” is important for how these updates work effectively.

Hope it helps! Feel free to ask if you need further assistance.

I still don’t know what mean by “…linear approximation…”.

Evaluation of the derivative for a particular value of the independent variable gives the exact rate of change of the dependent variable with respect to the independent variable at that value of the independent variable.

Basic calculus.

Near a specific point (like w=3), the function J(w) = w^2 can be closely matched by a straight line — the tangent line at that point.

So think of zooming in super close on the curve at w=3. At that tiny scale, the curve looks almost like a straight line. This straight line’s slope is exactly the derivative at that point.

What has that got to do with Andrew’s generalisation?

The generalisation is based on that idea of linear approximation. When he says that increasing w by \varepsilon causes J(w) to increase by approximately k \cdot \varepsilon, he’s using the tangent line (i.e., the derivative) as a way to approximate how the function behaves very close to that point.

The derivative gives the slope of that tangent line — and for very small \varepsilon, the increase in J(w) behaves almost linearly, just like that tangent line would predict. That’s what makes the generalisation work only for small \varepsilon. Once you go further from the point, the actual function and the tangent line start to diverge.

So yeah, the whole generalisation is just a practical use of linear approximation — nothing more mystical than that.

1 Like

Why are you using dot product notation for vectors when k and \epsilon are scalars?

In this case it’s not dot product notation. It’s just scalar multiplication. The “·” symbol is used for both dot product and scalar multiplication, depending on context. Both k and ε are clearly scalars here.

Nowhere in my education have I seen the dot notation used to denote scalar multiplication and Prof. Ng doesn’t use it either. He uses \times as in all mathematics, however \times is also used to denote the vector product operation between vectors as well.

I would encourage you to adhere to the same notation as Prof. Ng does as it is ambiguous to use the dot notation for both scalar and vector operations interchangeably.

Actually the dot notation denoting scalar multiplication is used but with no space on each side of the dot. When denoting the vector product there is usually a space on each side of the \cdot

Thanks for your suggestion!

However, according to the ISO 31-0 standard, which outlines international conventions for mathematical notation, the multiplication of quantities can be indicated by juxtaposition (ab), a space (a b), a centered dot (a·b), or a cross (a×b). The centered dot is often preferred to avoid confusion with the letter “x” or the cross product symbol “×”, especially in typeset mathematics.

I see that’s interesting.

I think as I quoted use of the \times notation in my original question directly from Prof. Ng’s course in this thread, we should adhere to that notation as it is valid, complies with ISO 31-0 and there is no need to change to an alternative notation as it doesn’t add value to the response but only confusion.

In my opinion, ISO 31-0 should be revised so that mathematical operators are unique and not dependent on context to make operations unambiguous.

1 Like

It’s certainly an interesting take to suggest that ISO 31-0, an international standard developed and agreed upon by numerous experts globally, should be revised simply because ‘someone’, ‘nowhere in their education’ encountered the centered dot for scalar multiplication.

By all means, you absolutely should enlighten these thousands of experts with your perspective. I’m sure they’ll be eager to hear that their widely accepted notation ‘doesn’t add value’ and only causes ‘confusion’ because it wasn’t part of ‘someone’s’ specific curriculum. Perhaps they’ll overhaul the standard once they realize it doesn’t perfectly align with this unique personal experience.

In fact, I wholeheartedly encourage you to reach out. You can probably find the contact information for the relevant ISO technical committee or their general inquiries email with a quick search online, they must have a channel for such groundbreaking feedback. Please, do share your proposal with them and be sure to let us all know how that revision process goes!

2 Likes

There’s no need for rudeness and sarcasm when someone gives an opinion on ISO 31-0.

It doesn’t help anyone and contributes nothing towards advancing understanding of mathematics and how best to express it better.

Well now, this is quite the conundrum! We all recently saw your passionate arguments about the centered dot for multiplication – how ‘nowhere in my education’ had you encountered it, how it supposedly ‘doesn’t add value but only confusion,’ and your bold call for ISO to revise their standards.

But then, imagine my surprise when I came across one of your earlier posts – Calculation of partial derivative of the cost function for logistic regression - Machine Learning Specialization / Supervised ML: Regression and Classification - DeepLearning.AI– where, clear as day, you yourself are using that very same centered dot for multiplication in your own handwritten work!

So, one has to wonder: if you were comfortably using this notation yourself before you launched your critique, what inspired your later claims of complete unfamiliarity and your campaign against its ‘confusing’ nature? Did you perhaps forget your own prior usage when you decided the international standard needed your urgent revision? It certainly makes one reconsider the foundation of those strongly defended opinions. Truly fascinating!

2 Likes

Rather than speculate about me, do you have anything to contribute mathematically and solution-oriented to the suggestion I made about revising ISO 30-1?

Your post has no bearing on the topic of this thread.

I suggest you review the topic and make a useful and appropriate contribution.

I suggest you revise your attitude, tone, and the way you address fellow learners and instructors in this healthy community. I went through the whole thread and the amount of rudeness and aggression coming from you every time someone tries to offer you knowledge or help is unbearable.

Instead of throwing your baseless remarks, criticizing things so common that everyone uses nowadays, educate yourself, or let others do. We are all here to learn not to mock each other’s attempt at helping.

2 Likes

I’m not going to be drawn into this argumentative and judgmental conversation with you.

Please only post content that contributes usefully to answering the topic.