I try to realize CNN visualize from the course. But I have a question want to ask.
The slide is as follow:
Why the images at the right hand side look so special? I consider each picture should look like:
Or actually what the slide means is the picture after filtering using kernel1?
Hope to get a fully explaination to realize it, thank you.
Welcome to the community. I don’t exactly understand which image are you referring to by right hand side. Are you referring to 4096-dimensional representation, or
110 x 110 x 96 dimensional representation?
The right hand side picture is the picture under the word 6x6x256.
I do not really understand what the slide says 9 images are.
More clearly, I want to know how to get the picture under the word 6x6x256.
I guess Prof Andrew has described it quite clearly in the video entitled “What are deep ConvNets learning?”. In order to receive these 9 images (note that it’s not a single image), we simply select any unit in layer 1.
Our purpose is to find the patches of images that maximize the activations of this chosen neuron/unit.
Now, there are 2 ways to do this. Since any unit in layer 1, will only get to see a small part of the entire input image, we can grab that part from all the images, and just pass the grabbed patches through the chosen neuron, and select 9 patches corresponding to the 9 largest unit’s (chosen unit) activations. Another way is to simply pass the entire images through the entire Conv layer, and then select the 9 images corresponding to the 9 largest unit’s (chosen unit) activations. Once we have acquired the images, we can simply grab our desired patch from each of them.
Both the methods will give you the 9 patches corresponding to the largest activations for the chosen neuron/unit. The first will be more computationally efficient, although. Feel free to review the lecture video once again, and if any query, we would be happy to help.
P.S. - Since this topic belongs to DLS C4, I have moved this to it’s appropriate place.
Thaks a lot. I consider that the first 9 patches(in layer 1) you say are after the computation of Conv2D and before the computation of activation to these 9 pictures, is that right?
When we select the patches, we select them by the order of their respective activations only, i.e.,
Conv2D + Activation Function (say ReLU), i.e., we select the patches with the largest activation values (corresponding to the chosen neuron). I hope this helps.
@Element covers Andrew’s lecture quite well. Now, I’m trying to clarify something from a paper side, since Andew’s lecture is “intuition”.
First of all, I may need to clarify mappings between Andrew’s intuition diagram and actual model in the paper. The upper diagram is from the paper.
Sometimes, a size reduction is done by a Max-Pooling, not just a Convolution. In the first layer, authors applied Conv2D followed by ReLU activation and MaxPool2D. So, the output from the first layer is 55x55x96.
One of key innovations of authors is to visualize the output (feature maps) of each layer with clarifying which part of an input image is focused. This small focused area is an ‘image patch’.
To revert a feature map into an image patch, they developed ‘deconvolution’ mechanism. That’s a reverse process of a forward data processing, but includes some magics. Examples are;
- Max pooling operation is non-invertible. So, they added new variables, called switch that records pooling information in a forward process. (This is used for “UNPOOL (Unpooling)” in the chart.)
- Uses transported versions of the same filters for deconvolutions (“FILTER (Filtering)” in the chart.).
The basic deconvolution flow is;
- The first layer output goes through a deconvolution layer. It’s a easy to understand.
- The 2nd layer output goes through a deconvolution layer in the 2nd layer, and then, output is forwarded into the deconvolution layer in the first layer. (Same to the other higher layer).
So, you can easily understand the output in the higher layer covers broader area in an input image, i.e, is a larger image patch which includes details. (This also can be easily understood from the coverage of a filter against the image. 7x7 filter for 224x224 image in the first convolution, but 3x3 filter for 13x13 image in the 4th layer.
The paper claimed that, with this visualization, they found that the feature map from the layer 1 are not clear enough. So, they changed the Krizhevsky model (best model at that time), and got a better performance. What they changed are a filter size (11x11 → 7x7 for the first layer) and a stride (4 ->2).
Hope this can be a supplement to Andew’s talk.