Should it have a rot180 on filter to calculate dA_prev?

tarksar · April 21, 2022, 3:08pm

Hi Buddies,

After a review of the math behind CNN, I begin to realize there could be something wrong with the instruction which says “da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c]”.
Should there actually be a rot180 operator on the filter? which ends up with something like np.rot90(W,2)? Because in math to calculate dA_prev we need to convolute padded delta(l) with filters rotate 180 degrees rather than itself.
However, as the test part only check mean(dA) so I anyway pass the test XD.
Am I misunderstand here?

Many thanks

paulinpaloalto · April 21, 2022, 3:52pm

rot90 and rot180 are geometric operations. What does that have to do with computing and applying gradients? Do you mean “transpose”? That’s a completely different thing.

Prof Ng has specifically designed these courses not to require any knowledge even of univariate calculus, let alone higher dimensional and matrix calculus. So any questions about the derivations of the formulas given to us are beyond the scope of this course. Here’s a thread that has links to some material on the web that goes into the math behind all this. (Please note that thread is linked from one of the topics on the DLS FAQ Thread, which is also worth a look if you haven’t seen it yet).

The one high level thing worth pointing out is that Prof Ng has done one sort of “simplification” in the way he presents the results of the math: he uses the convention that the gradient of an object and the base object are the same shape. If you really take the math literally and carefully, it turns out that the gradient of an object has the shape of the transpose of the base object. But for the way we are applying the gradients, it’s simpler to formulate things the way Prof Ng does. If you’re not actually doing the underlying math, his way is just simpler and cleaner. I’m guessing that is the real point that answers your question here.

tarksar · April 22, 2022, 2:49am

Thanks Paul

But I think my confusion haven’t solved yet. To clarify my question, please allow me take a bit more space to cite the example I’ve seen. Now, suppose we have the activation of the (l-1)th layer, which can be write as:

and its corresponding filter
(as you can see here in the example the filter doesn’t rotate here)
so in forward propagation we can have a after convolution operation with valid padding, stride 1 and 0 bias where the matrix elements denotes:

and then in backward propagation suppose we got
on the lth layer and by the chain rule we have relation like
and so the matrix of gradient shall be

which equivalent to the result of padded delta(l) convolve with the rotated filter W which looks like this

where the filter matrix W has been rotated 180 degree (and yes, not transpose)
so here was my confusion came from, as in the assignment we convolve dZ with W (unrotated) to get dA_prev, which would make different result. (Although both method pass the test as the assertion only check dAmean) Is the example I posted wrong?

Thanks again for you help

paulinpaloalto · April 22, 2022, 3:22am

Sorry, but I am not the right person to answer this question. Maybe some of the other mentors or other fellow students will be interested to spend time on this.

tarksar · April 22, 2022, 3:37am

Thanks for your attention anyway.

TMosh · April 22, 2022, 8:17am

Sorry, but I do not completely understand your question.
However, no rotation of the matrices is needed.

tarksar · April 23, 2022, 9:34am

Hello Mentors,

After another check I’ve got idea.
So in the instruction “W[:,:,:,c] * dZ[i, h, w, c]” the operator is multiply
while in that example, the operator is convolve
and both methods could bring me correct gradients.
My confusion maybe came from the different definition of “*” operator and lack of practise.

Many thanks

paulinpaloalto · April 23, 2022, 4:05pm

Yes, that is elementwise multiply. But also note that is the fundamental operation in convolution. It just gets repeated across the input using the stride. And what we are doing here is the backwards equivalent of that “atomic” operation to apply the gradients. The scale of the operation is the same in both cases.

Topic		Replies	Views
Week 1, lab 1, backpropagation - computing dA Convolutional Neural Networks week-1 , coursera-platform	1	11	March 16, 2025
W1 conv_backward (I can't understand instruction) Convolutional Neural Networks coursera-platform	6	567	July 22, 2023
W2_A2_Calculation of Partial derivatives Neural Networks and Deep Learning coursera-platform	12	1007	July 24, 2023
Course 1 Week 3 Backpropagation Intuition (Optional) Neural Networks and Deep Learning coursera-platform	5	810	December 18, 2021
Week 1 Assignment 1 Excercise 8 pool_backward Convolutional Neural Networks coursera-platform	11	533	October 20, 2022

Should it have a rot180 on filter to calculate dA_prev?

Related topics