In video: terrible abuse of notation with "dw" denoting the gradient ... there is no need for that!

dtonhofer · January 6, 2025, 2:13pm

In “Gradient Descent”, our excellent host Andrew commits the following notational abuse:

The Gradient “dJ(w)/dw” is denoted “dw”.

Please no! It should be called “∇J”, which is the proper notation for that mathematical object.

Denoting it “dw” is so confusing. Just for starters it’s not an infinitesimal, it is the gradient. And it’s not a change in “w”, it is the (local) change in “J” (which we use it to update our current “w” in iterative fashion)

It is then said “if you are not familiar with infinitesimals…”. The poor person who has just been thrown the above curveball!

Can this be corrected? I’m actually a bit dismayed.

Update later in the day

I have progressed a bit and arrived at the page with the example code where the typeface, which is of course lost when performing handwriting, indicates to a sufficient degree that the dvars are program names are in no way mathematical notation (although the first line is unfortunately not naming consistent with the rest):

Alireza_Saei · January 6, 2025, 2:53pm

Hi @dtonhofer

You raise a valid point, and it’s understandable why it might feel misleading. However, the simplified “dw” notation is widely used in machine learning as shorthand to improve accessibility for those new to calculus.

For those of us who know calculus, though, it shouldn’t be a problem—right?

dtonhofer · January 6, 2025, 3:16pm

Thank you for the answer.

I guess I will have to get used to it

paulinpaloalto · January 6, 2025, 3:50pm

The other point to realize as you go through this is that everything here is a derivative of either J or L, right? The question is “with respect to what”? So just using the notation \nabla J is going to be ambiguous. E.g. what do you call what Prof Ng calls db?

But the higher level point here is that ML notation is not the same as math notation. I also came to this from the math side of the world, so had to adjust a bit. Another example is when they say log here, they always mean natural log.

Prof Ng is the boss here, so he gets to choose his own notation and we just have to deal with it. The real thing to be aware of is that when he says dSomething, you have to be careful to realize whether he means a derivative of J or of L w.r.t. the Something. That’s the real ambiguity. Here’s a thread which discusses that point w.r.t. the factor of \frac {1}{m} that you see in dW and db, but not in other gradients.

dtonhofer · January 6, 2025, 9:27pm

Thank you.

I have gone over the gradient descent derivation for the 2-input perceptron / 2-d logistic regression on paper and have settled on the trick to “quote” the dvars partial derivatives of L, so that they look more like program variable names. Which, after all, they are. .

While doing so, I also got confused for at least 5 minutes when I noticed that I was unsure whether those partial derivatives I was looking at were functions or values. They are values! Thus I found it expedient to add vertical bars after the d/dx expressions with, at their bottom, the position at which the partial derivative shall be evaluated. It’s an interesting subtlety that I initially buried under too many mathematical reflexes and assumptions.

Looking good.

Topic		Replies	Views
Week 3 - Last video titled: 'Gradient descent and back propagation Calculus for Machine Learning and Data Science week-3	3	22	October 11, 2024
Explanation for derived gradients for LSTM back-prop? Sequence Models	3	678	September 6, 2021
W4_A1_Inconsistent cost function notation in formula 8 and 9 Neural Networks and Deep Learning	3	527	January 18, 2023
Week 2, Exercise 6 -- Error Improving Deep Neural Networks: Hyperparameter tun	2	570	February 3, 2022
Week 2. Why we multiplying by slope instead of dividing? Neural Networks and Deep Learning	4	516	May 14, 2023

In video: terrible abuse of notation with "dw" denoting the gradient ... there is no need for that!

Related topics