C4_W1_A1 dW and db calculation question

younglee · March 8, 2023, 4:11am

When computing dW and db in this convolutional layer, why do we only compute summations and don’t divide them by number of examples m?

In C1_W3_A1 for plain NN, we divided summations by m to get final dW and db. I wonder why we do differently here in C4_W1_A1.

TMosh · March 8, 2023, 5:13am

Since ‘m’ is a constant, it doesn’t matter whether it’s included or not. It’s just a scaling factor. It has no impact on which weight values will give the minimum cost.

younglee · March 8, 2023, 6:50am

Thank you very much for the answer. But I still cannot clearly understand this scaling change, mainly because dA_prev’s scaling looks still consistent with Course 1 materials. I would really appreciate if you can provide clarification on the additional questions below.

I think my question is conditioned on assumption that, bewteen Course 1 and Course 4, scaling has changed only for dW and db, not for dA_prev. Maybe my first question is “is it the case?”

If so, wouldn’t changing scales only for dW and db makes differences in path and speed to find the optimal weights, especially when m is very large? Is scaling not a concern because we usually use more advanced optimization method like Adam? Or is it still not a concern even if we use a basic batch gradient descent with a single fixed learning rate?

rmwkwok · March 8, 2023, 1:03pm

Hello @younglee

dA_prev is a per-sample variable, so it is not scaled by m. It is true that dW and db are required by the C1W3A1 assignment to be divided by m, however, in C4W1A1 there is no such requirement. This can be verified from the exercise’s description.

Let’s focus on the C1W3A1 first. If we check out all of the formula, we should find that the “origin” or the “source” of 1/m is actually from the cost function. We are just passing down this 1/m through the back-prop and finally add it back to dW and db.

The same idea should apply to C4W1A1, even though this time the assignment does not show (or use) any cost function because it is not a complete neural network this time (It is a complete NN in C1W3A1).

Therefore, when we implement a complete convolutional NN, there should be a 1/m for dW and db due to the cost function. I hope this point is clear now.

As Tom has explained, 1/m can be looked as a scaling factor. If dW and db are not divided by m, then we will need to adjust another scaling factor which is the learning rate to compensate for that. Therefore, it is a concern, and it is the same concern no matter we use Adam or not.

However, as I said above, in a complete CNN, the 1/m should pass down to each dW and db, and thus the learning rate would not need to worry for this problem.

Cheers,
Raymond

younglee · March 9, 2023, 12:11am

I really appreciate your detailed explanation, Raymond! Now I better understand what’s going on.

And I think now I understand Tom’s comment. Because parameters here are only W and b, applying same scaling factor to dW and db would not change the solution as far as if learning rate compensate it. That makes sense. Thank you, Tom!

Thank you very much!

Topic		Replies	Views
Course 4 Week 1 Assignment 1 - Exercise 5 conv_backward() Convolutional Neural Networks	1	572	April 21, 2022
Backpropagation in Convolutional Neural Networks - dW overall derivative Convolutional Neural Networks	2	549	August 24, 2022
Dividing by "m" in back propagation using vectorized implementation Neural Networks and Deep Learning week-3	3	458	February 19, 2024
Question Regarding Scaling of V_(dw) and V_(db) Improving Deep Neural Networks: Hyperparameter tun week-2	1	234	February 5, 2024
C4W1 CNN back propagation Convolutional Neural Networks	1	618	November 2, 2021

C4_W1_A1 dW and db calculation question

Related topics