Dear @donghyuk ,
Welcome to the Discourse Community! Thanks a lot for posting your question here. I am a Mentor and I will do my best to answer your questions.
In backpropagation for convolutional neural networks (CNNs), it is common to perform convolution with 180-degree flipped weights. This is done by flipping the filter or kernel vertically and horizontally. The purpose of this flipping is to ensure that the gradients are computed correctly during the backpropagation process.
When performing backpropagation in a CNN, the gradients are propagated backwards through the network to update the weights. The gradients are computed by taking the derivative of the loss function with respect to the weights. In order to compute these gradients correctly, the weights need to be flipped.
Regarding the specific code snippet you provided, it seems to be related to padding. Padding is often used in CNNs to preserve the spatial dimensions of the input and output feature maps. In the code, da_prev_pad refers to the gradient of the previous layer’s activation function with padding, and dA_prev refers to the gradient of the previous layer’s activation function without padding. The code snippet is assigning the values of da_prev_pad to dA_prev after removing the padding. This is done to ensure that the gradients are correctly propagated through the network without including the padded region.
I hope I was able to help you. Please feel free to ask a followup question if my reply was not clear to you.
Regards,
Can Koz
I think it is valid. What is the reason why you would need to “flip” the gradients? The filters are applied “as is” in a straightforward manner. The only difference between convolutions and fully connected nets is that the filters get applied at multiple points, so the gradients here are the sum of the gradients from each of the points in the output space, as well as the sum (well average) over the samples. Can you point us to some reference where it explains why “flipping” would be required in a case like this?
That is the point that Can explained in the earlier response: the area that is getting dropped is the area corresponding to the added padding. Those parts literally didn’t exist until they got “magically” added, so there is no point in propagating gradients to them since there is no preceding values to propagate backwards to. In other words, the padding areas are a “dead end” when you’re going backwards. As with everything here, the forward and backward processes are the mirror images of each other. On the forward pass, those values are just magically created from nowhere as 0 and on the backward pass they are a dead end, since there’s nothing to propagate the gradients to.
understand. when I drew the matrix, I realized that
I mistakenly understand convolution and elementwise multiply.
actually convolution with flipped matrix is same process to elementwise multiply with this instruction.
and also realize that padding is constructed as ‘no infomation is forward’ since padding=2 and filtersize=2 and stride is also 2 .
that is why in backpropagation, the element is ‘deleted’.
(but I think the padding setting is a little weird though. is it actually practical method? or just for theoretical example for learners?)
Padding is a “real thing”. It is used in cases in which you don’t want the h x w sizes to decrease too quickly as you go through the convolution layers. It also allows the filters to get more value out of the real values that are at the edges of the input. Without padding, each edge pixel is only included in one output value in one of the dimensions. It may seem counterintuitive at first, but it is really used. You’ll see examples as we go through this course of when it is used, so “hold that thought” and keep an eye on the types of networks we learn about.
Also it’s been a while since I watched the lectures here, but I’m sure Prof Ng discusses padding in the week 1 material.
@paulinpaloalto
I meant not general usage of padding, but W1 conv_backward’s forward step padding settings.
(pad =2, filtersize=2, and stride=2) so it looks “no padding is required” setting because at first and last step, only ‘padding region’ is calculated and after that, only ‘real region’ is calculated.