Where does the (1/m) come from mathematically?

image
J = (1/m) SUM: L(yhat, y)
dJ/dZ = (1/m) SUM: dL(yhat, y)/dZ

Let X = A[l-1]
Z = WX+b
dZ/dW = X

so dJ/dW = dJ/dZ * dZ/dW = dJ/dZ * X, right?
but why did the derivatives include another (1/m) such that dJ/dW = (1/m) * dJ/dZ * X?

Intuitively it makes sense to average it; but mathematically I couldn’t understand why…

Thank you so much for your time and help!

Hi @Nick_He

Welcome to the community.

This is a topic that i not being working in a while, but, i will try to help you out.

So, the additional factor of (1/m) in the derivative expression comes from the chain rule of calculus. Let’s go through the derivation step by step to understand where it comes from.

Given the cost function J, which is defined as the mean of the loss function L over m training examples:

J = (1/m) * SUM(L(yhat, y)) over all training examples

Now, we want to calculate the derivative of J with respect to the weights W. In order to do that, we can use the chain rule of calculus.

Let’s denote dJ/dZ as the derivative of J with respect to the weighted sum Z. It means, dJ/dZ = (1/m) * SUM(dL(yhat, y)/dZ) over all training examples.

Next, we want to find dZ/dW, which represents the derivative of the weighted sum Z with respect to the weights W. Since Z = WX + b, the derivative of Z with respect to W is simply X.

Now, to calculate dJ/dW, we apply the chain rule:

dJ/dW = dJ/dZ * dZ/dW

Substituting the expressions we derived earlier:

dJ/dW = (1/m) * SUM(dL(yhat, y)/dZ) * X

Now, the factor of (1/m) comes into the derivative expression because of the chain rule and the definition of the cost function J as the mean of the loss function L. When we take the derivative of J with respect to the weights W, we have to consider that J is an average over m training examples. The derivative of an average is equal to the average of the derivatives, which leads to the (1/m) factor.

Intuitively, the (1/m) factor is there to ensure that the gradients are appropriately scaled when performing stochastic gradient descent (SGD) or other optimization algorithms. Without this scaling, the learning rate might be too large, leading to unstable or inefficient learning.

In summary, the (1/m) factor in the derivative expression comes from the chain rule and the definition of the cost function J as the mean of the loss function L over m training examples. It is a mathematical necessity to ensure the gradients are properly scaled during the training process.

Best regards
elirod

2 Likes

Thank you so much elirod! I understand the intuitive part and I completely agree with the result “dJ/dW = (1/m) * SUM(dL(yhat, y)/dZ) * X.”

Just to double check, the (1/m) factor in my question refers to the factor in this expression:
image
which seems weird because from what we just derived, dJ/dZ = (1/m) * SUM(dL(yhat, y)/dZ), and dJ/dW = dJ/dZ * dZ/dW = dJ/dZ * X. But the fact that there’s an additional (1/m) in the given formula of dJ/dW such that dJ/dW = (1/m) * dJ/dZ * X was confusing.

Based on what you’ve said, do you mean that we manually add in this additional (1/m) to ensure the average of the values which has a better performance, but it is not something we inherit from the derivative expression? Thank you so much again for your time and clarification!

1 Like

I hope to just share this reading item from the Course 1 Week 4, and hopefully this can clear up some confusion.

These equations are meant to use together. If we want to use one of them in computing a gradient, we need to use the rest of them in computing all other gradients. We can’t just pick one equation for a certain gradient, and invent a new equation to compute another.

With the above list of equations, only the equations for dW and db have 1/m, and since none of them is used to compute another gradient, we don’t apply 1/m more than once to any gradient.

Raymond

3 Likes

Thank you so much! Just to confirm, is the idea that only one (1/m) exists as a derivative to the cost function, but we can choose where to use it. Here since dW[l] and db[l] are at the end of the chain rule, we can safely apply 1/m to each. But for cases like dZ[l-1], it’s continually being chain-ruled more, so we save (1/m) for its dW[l-1] and db[l-1], and so on. This makes sense. It seems like I was interpreting the definition too strictly?

@Nick_He

Yes. This is the idea.

I would interpret

  • dZ^{[L]} as \frac{\partial{L}}{\partial{Z^{[L]}}}, and
  • dW^{[L]} as \frac{\partial{J}}{\partial{W^{[L]}}}, and
  • db^{[L]} as \frac{\partial{J}}{\partial{b^{[L]}}}

Note the difference of L and J in the “numerators”. See if you think that makes some sense too?

Note that L is the loss of a sample. J is the cost over all samples.

Cheers,
Raymond

1 Like

This makes a ton of sense. Thank you so much for your time and help!

1 Like