C2W1 - Initialization and backward_propagation() in init_utils.py

In C2W1 first assignment for initialization, at section 3, inside model() function, an auxiliary function backward_propagation() is called, which is implemented inside the auxiliary file init_utils.py.
I opened that file to check the code inside that function and I have a question.

The general equations about backpropagation (vectorized form for m samples) says that:
dW = (1/m)*np.dot(dZ, previous_A.T)

Why the division by m does not happen for dW2 and dW1, but is applied only at dW3 (via dZ3) ?

Hello @Charalampos_Inglezos

For example, dW2 has that 1/m via dz2, da2, and dz3 too.

Cheers,
Raymond

Not exactly, the formulas say that at each dW[l] we have to divide by m, not only once at the beginning at L layer.
If you continue to C2W1 assignment 2, in the graded exercises you can see that there the division by m is applied at every dW

It is exactly the case in the piece of code that you have shared :wink: It is just that the code is not implemented in exactly the same way as the formulae stated but it can work.

We are discussing the code you have shared, right?

Can you see this in the code or not?

Cheers,
Raymond

On the other hand, you might also remove that 1/m, and add it to each dw and db, so that they become consistent with the formulae. However, I think your following statement:

actually showed your correct understanding of the situation even though the same thing was done in a different way, which is actually excellent!

Cheers,
Raymond

yes I can see that, but I am not sure if this corresponds to the shown formulas. For example, theory says da_prev = np.dot(w, dz) and no division by m, but if we follow the code in the picture, da2 contains 1/m from the first dz3 term, which is not the formula from theory.

As I said, the code is not implemented in the same way as the formulae but the code is not wrong.

Shall we focus on whether they can deliver correct results first? Or are you more concerned about the way the code was implemented? If it is the later case, feel free to change it back to the way that the formulae state, and I think it is also a good exercise to verify whether the change will produce the same set of results or not :wink:

Cheers,
Raymond

I don’t doubt it, I’m sure the code will work as it is.
As a student though, I would expect consistency between theory and practice, as each line of code shows intention and a silly example, instead of λ/2m it could be 4λ/8m and everybody would scratch their head to think where that came from. But yeah, of course 4λ/8m will work absolutely fine giving the same results!
Anyway, a simple mention of that alternative in the video lectures would suffice.

Hello @Charalampos_Inglezos,

Thank you for the sharing :smiley: ! Indeed, I would scratch my head too! And I can see that you were trying to suggest for implementing the formulae exactly as stated in the lecture.

Here is my sharing. In my experience, variations don’t just happen in an utility script, but can also be in a quiz, in online references, in codebase at workplace and many more. At the end of the day, it is our understanding and our actions to verify that will get us through.

I think this disucssion is an example that, with efforts, unclear can become clear. In my opinion, it is not a bad process at all, because the difference should be managable, in a lab environment which is supposed for us to get hands dirty and to try and to verify, it represents variations that can happen from time to time in our careers, there is a version that stick to the formula in the assignment, and we have this example discussion.

I know this is just one example for communicating the potential confusion, so we probably shouldn’t go too far into it.

Changing 1/2 to 4/8 in that situation is definitely non-sense, and actually, sometimes learners would also find adding the 2 in the denominator to be confusing.

However, we want to talk about why the 2 is added there, right? And we probably want to think about if there is actually any difference made to the execution of the code to put that 1/m there. As a learner myself too, I think this is something I can ask about, and indeed I actually had thought about it as I read your question!

The factor of 2 in the denominator cancels out the other factor of 2 in the numerator induced by taking the derivative of the mean squared loss. Such cancellation also saves our computer’s time from computing that unnecessary factor.

Putting the 1/m there in that script can also save our computer’s time, because it will only need to do that computation once. However, the actual amount of time that can be saved will be our own job to verify.

That amount of time saved will definitely scale with the number of rounds of gradient descent.

I will scratch my head when I see a variation, however, it is very likely that I will implement a variation myself if I have other purposes in mind which can include code optimization. As a learner myself, I think that experience is invaluable.

Maybe everything I have said here was not the intention from the person who had prepared that piece of code, but I do think it is not a bad coincidence and it is not a bad learning opportunity.

@Charalampos_Inglezos, I guess not all learners will go over the utility scripts, but also as a learner, I am glad that you have :slight_smile:

Cheers,
Raymond

1 Like

Yes I agree, sometimes we have to dig into the code and verify ourselves what works and what does not…
Thank you for the conversation :slightly_smiling_face:

You are very welcome, @Charalampos_Inglezos :slight_smile: