Should it have a rot180 on filter to calculate dA_prev?

Hi Buddies,

After a review of the math behind CNN, I begin to realize there could be something wrong with the instruction which says “da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c]”.
Should there actually be a rot180 operator on the filter? which ends up with something like np.rot90(W,2)? Because in math to calculate dA_prev we need to convolute padded delta(l) with filters rotate 180 degrees rather than itself.
However, as the test part only check mean(dA) so I anyway pass the test XD.
Am I misunderstand here?

Many thanks


Equtation from other source

rot90 and rot180 are geometric operations. What does that have to do with computing and applying gradients? Do you mean “transpose”? That’s a completely different thing.

Prof Ng has specifically designed these courses not to require any knowledge even of univariate calculus, let alone higher dimensional and matrix calculus. So any questions about the derivations of the formulas given to us are beyond the scope of this course. Here’s a thread that has links to some material on the web that goes into the math behind all this. (Please note that thread is linked from one of the topics on the DLS FAQ Thread, which is also worth a look if you haven’t seen it yet).

The one high level thing worth pointing out is that Prof Ng has done one sort of “simplification” in the way he presents the results of the math: he uses the convention that the gradient of an object and the base object are the same shape. If you really take the math literally and carefully, it turns out that the gradient of an object has the shape of the transpose of the base object. But for the way we are applying the gradients, it’s simpler to formulate things the way Prof Ng does. If you’re not actually doing the underlying math, his way is just simpler and cleaner. I’m guessing that is the real point that answers your question here.

Thanks Paul

But I think my confusion haven’t solved yet. To clarify my question, please allow me take a bit more space to cite the example I’ve seen. Now, suppose we have the activation of the (l-1)th layer, which can be write as:
image
and its corresponding filter
image (as you can see here in the example the filter doesn’t rotate here)
so in forward propagation we can have a image after convolution operation with valid padding, stride 1 and 0 bias where the matrix elements denotes:
image
and then in backward propagation suppose we got
image on the lth layer and by the chain rule we have relation like
image and so the matrix of gradient shall be
image
which equivalent to the result of padded delta(l) convolve with the rotated filter W which looks like this
image

where the filter matrix W has been rotated 180 degree (and yes, not transpose)
so here was my confusion came from, as in the assignment we convolve dZ with W (unrotated) to get dA_prev, which would make different result. (Although both method pass the test as the assertion only check dAmean) Is the example I posted wrong?

Thanks again for you help

Sorry, but I am not the right person to answer this question. Maybe some of the other mentors or other fellow students will be interested to spend time on this.

Thanks for your attention anyway.

Sorry, but I do not completely understand your question.
However, no rotation of the matrices is needed.

Hello Mentors,

After another check I’ve got idea.
So in the instruction “W[:,:,:,c] * dZ[i, h, w, c]” the operator is multiply
while in that example, the operator is convolve
and both methods could bring me correct gradients.
My confusion maybe came from the different definition of “*” operator and lack of practise.

Many thanks

Yes, that is elementwise multiply. But also note that is the fundamental operation in convolution. It just gets repeated across the input using the stride. And what we are doing here is the backwards equivalent of that “atomic” operation to apply the gradients. The scale of the operation is the same in both cases.