After a review of the math behind CNN, I begin to realize there could be something wrong with the instruction which says “da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c]”.
Should there actually be a rot180 operator on the filter? which ends up with something like np.rot90(W,2)? Because in math to calculate dA_prev we need to convolute padded delta(l) with filters rotate 180 degrees rather than itself.
However, as the test part only check mean(dA) so I anyway pass the test XD.
Am I misunderstand here?
rot90 and rot180 are geometric operations. What does that have to do with computing and applying gradients? Do you mean “transpose”? That’s a completely different thing.
Prof Ng has specifically designed these courses not to require any knowledge even of univariate calculus, let alone higher dimensional and matrix calculus. So any questions about the derivations of the formulas given to us are beyond the scope of this course. Here’s a thread that has links to some material on the web that goes into the math behind all this. (Please note that thread is linked from one of the topics on the DLS FAQ Thread, which is also worth a look if you haven’t seen it yet).
The one high level thing worth pointing out is that Prof Ng has done one sort of “simplification” in the way he presents the results of the math: he uses the convention that the gradient of an object and the base object are the same shape. If you really take the math literally and carefully, it turns out that the gradient of an object has the shape of the transpose of the base object. But for the way we are applying the gradients, it’s simpler to formulate things the way Prof Ng does. If you’re not actually doing the underlying math, his way is just simpler and cleaner. I’m guessing that is the real point that answers your question here.
Sorry, but I am not the right person to answer this question. Maybe some of the other mentors or other fellow students will be interested to spend time on this.
Thanks for your attention anyway.
Sorry, but I do not completely understand your question.
However, no rotation of the matrices is needed.
After another check I’ve got idea.
So in the instruction “W[:,:,:,c] * dZ[i, h, w, c]” the operator is multiply
while in that example, the operator is convolve
and both methods could bring me correct gradients.
My confusion maybe came from the different definition of “*” operator and lack of practise.
Yes, that is elementwise multiply. But also note that is the fundamental operation in convolution. It just gets repeated across the input using the stride. And what we are doing here is the backwards equivalent of that “atomic” operation to apply the gradients. The scale of the operation is the same in both cases.