Image patches, hidden conv "units" and more

I’ve found what appear to be four omissions, inconsistencies or errors in the C4W4 “What are deep ConvNets learning?” lecture video in the (Neural Style Transfer section) slide 2. However, it’s rare to have four such issues on one slide so I very much welcome a sanity check. Can anyone help? Thanks!

“Units” of the ConvNet layer 1 are referred to a few times here on slide 2 and it is essential to understand what they are to comprehend this slide. However, I do not recall them being defined in any of the DLS courses. A large Stack Exchange thread quoting this course has emerged (neural networks - definition of "hidden unit" in a ConvNet - Cross Validated). I believe the most concise definition is that a “unit” of layer 1 of this ConvNet is a single filter of dimension 115x115x3, of which there are 96 in layer 1 here. Is this correct?

Number of patches visualized. Ng chooses to plot exactly nine image patches, but it is not explained where nine comes from and numbers matter a lot in this course, especially when they are visualized. Is nine an arbitrary choice? If not, why not?

“Seeing” the network. Ng says that “a hidden unit in layer 1, will see only a relatively small portion of the neural network.” But a hidden unit in layer 1 doesn’t see any of the network. It sees input images on which it operates, one position at a time. Correct?

Image patch size. Ng also says “And so if you visualize, if you plot what activated unit’s activation, it makes makes sense to plot just a small image patches, because all of the image that that particular unit sees.” This makes intuitive sense. However, in this particular example, backsolving for the size of the layer 1 filters suggests that each of the 96 filters is 115x115x3 which is over half the input image size. This is not “small.” Further, each of the nine example patches shown seems to be of significantly lower resolution than 115x115. I’m guessing there is an error here where the layer 1 filters are intended to be of small planar dimensions (say 17x17) but accidentally were made 115x115 which is over half the input image dimension. Is that right?

Number of units visualized. After showing 9 image patches for a single hidden unit, Ng repeats this exercise for an additional 8 units for a total of 9 units. However, there are 96 units, not 9, right? In this case, is the choice to visualize exactly 9 units also an arbitrary choice? If not, why not? Further is the choice of the number 9 for both image patches and units (patches and units are different) a coincidence? (Assuming so but want to sanity check.)

Hey @am003e,
Allow me to answer your question in parts, as you have presented. For most of the answers, I will be referring to the slide entitled “Visualizing what a deep network is learning”.


No, this would be incorrect. A “unit” or “hidden unit” is referred to as a single element in any of the layers. For instance, if we consider the layer with dimensionality 110 x 110 x 96, in that case, we will have 110 * 110 * 96 = 1161600 units.


Here, I believe there is a slight discrepancy. I believe that the first filter in this network is of dimensionality 3 x 3 x 3. So, in this case, f = 3, and I am assuming stride, i.e, s = 2 and padding, i.e., p = 0. Since this is the closest possible configuration that would give us a final output height and width of 110, however, if we compute the output dimensionality using this configuration, I believe the output dimensionality should be 111 x 111 x 96 (assuming 96 filters are used). I will raise an issue regarding this. But nonetheless, I believe that the convolutional filter used here is of 3 x 3 dimensionality, and hence, a single input unit will be inspired from 9 image patches. So, no, 9 is not an arbitrary choice, and I believe you have understood the reason why.


By a “relatively small portion of the neural network”, Prof meant that only the first layer itself, or in other words, just the input image, as you stated. As you keep on going into deeper layers, they will keep on seeing larger portions of neural network, and hence, larger image patches will be visualized for the deeper layers.


I guess the answers to the remaining questions already lie in the above answers, so I will be leaving those. Feel free to let us know if you face any other queries.

Cheers,
Elemento

Thanks Elemento.

Regarding units, thank you for the concise clarification. This makes sense and agrees with my understanding of FC units; in both cases the rule is “one activation value corresponds to one unit.” In this case the activation is 110x110x96 so that’s the number of units - thanks. (Request: since CNN “units” haven’t been explicitly defined, maybe say so. It’ll help a lot to understand that the 9 images shown are for one of literally over a million units.)

Regarding “seeing” the network, I appreciate your sanity check that my understanding is correct. Since Prof Ng is discussing the first layer when he says this, I believe a correction of the lecture text / notes is warranted.

Regarding image patch size, my math disagrees with yours; however, I believe you’ve hinted at a solution which makes sense. In my OP, when I calculated 96 filters of size 115x115x3, I was assuming no zero padding and guessed stride of 1. What seems more likely is that the stride must have been 2 (which unfortunately was not mentioned in the lecture), in which case the output dimension, according to the formula, is floor((n+2p-f)/s+1), in this case 110 = floor((224+2p-f)/2+1). Solving for s and assuming no zero padding, we find f=6 which is much more in line with Ng’s lecture comments about small patches. So we have 96 filters of dimension 6x6x3 with stride of 2. (If my math is right, it is impossible to obtain an output of size 110x110x96 with any stride other than 1 or 2, unless a massive amount of zero padding were applied - which we haven’t seen in this course.) My math seems to disagree with yours since my filters come out to 6x6x3 each not 3x3x3 each. Either way, I’d request an update to the lecture text / notes to mention that the stride is 2 and the filters are 6x6x3 since the rest of this slide is very difficult to understand otherwise.

Regarding number of units visualized, I believe your comment is addressing image size. I now understand that if the filters are 6x6x3 then then a single filter output (one unit) of the first layer “sees” input of size 6x6x(RGB) in which case the patches shown should be 6x6xRGB. But I was actually asking about the fact that we are shown 9 images as opposed to 5 or 10 or any other number. After staring at this more, I think I’ve convinced myself that the number 9 is truly arbitraryand it is a coincidence that the number of units visualized at each layer is also 9. There are literally over 1 million units to show for layer 1, and over 500k for layer 2, so Prof Ng has to choose a number of units to show and a number of image patches for each, and he just happens to pick 9 both times. Request if the lectures are updateable: say that these numbers are arbitrary.

Thanks again for the help!

Hey @am003e,
Let me once again reply part by part.


First of all, apologies for this one. I misunderstood this actually. In the diagram, the dimensionalities that are given are for the inputs and outputs of the various layers, and not for the network’s layers themselves. So, assuming that the first layer is a convolutional layer with 96 filters, each having a dimensionality of 3 x 3 x 3, we would have 96 * 3 * 3 * 3 = 2592 units or hidden-units in the first layer. Now, as far as I recall, “hidden units” is a self-explanatory and a common phrase, and doesn’t need any further understanding, so I would be concluding this one here.


Regarding this, the fact that we can treat the image as a part of the neural network (i.e., the input of the neural network) is subjective, and once again, I believe, using the phrase “small portion of the neural network” is fairly trivial to understand. I will leave it as of now. If a learner raises a similar concern in the future, I will create an issue for this one.


Thanks for your take on this. So, the configuration that gives the output dimensionality as 110 x 110 x 96 is f = 6, s = 2, p = 0. However, I still believe that f = 3, since Prof Andrew displays 3 * 3 = 9 RGB patches for any of the hidden units. Let me confirm this with the team if the output dimensionality is wrong or if the number of patches is wrong. Either way, some correction would be needed here. I will get back to you on this one.


Once again apologies, I misunderstood your query. Yes, you are correct, the #different hidden-units visualized = 9 for any of the layers, and this is an arbitrary choice. I believe the reason that this was chosen is for better visualizations. Similarly, #patches visualized for 1 hidden-unit = 9 is again an arbitrary choice. The choice which is not an arbitrary one is the size of each of the patches visualized. For all the hidden units in the first layer, the size of each of the patches would be 3 x 3 x 3. For all the hidden units in the second layer, the size of each of the patches would be 27 x 27 x 3 (assuming 3 x 3 convolutional filters in the first and second layers). As you can see in the next slide, irrespective of the size of the patches, they are shrunk to occupy the same space for visualization purposes.

I hope this helps. Let us know if you have any further queries.

Cheers,
Elemento

Hey @am003e,
The intriguing thing here is every time I think about this, I come up with a new dilemma. Let me discuss this issue with the team, so that we can figure out what are the discrepancies and what are the assumptions taken in this lecture video. For now, I would suggest you to move on forward to the lecture videos ahead, and I will get back to you on this one, as soon as we are able to figure out the aforementioned.

Cheers,
Elemento

I misunderstood this actually. In the diagram, the dimensionalities that are given are for the inputs and outputs of the various layers, and not for the network’s layers themselves. So, assuming that the first layer is a convolutional layer with 96 filters, each having a dimensionality of 3 x 3 x 3 , we would have 96 * 3 * 3 * 3 = 2592 units or hidden-units in the first layer.

Thanks for mentioning this; I think this highlights why it would be good for the lecture to define a CNN unit. If I understand you correctly here, one unit corresponds to one filter coefficient (of which there are 96x6x6x3 by my math or 96x3x3x3 by your math), not to input or output activations (for which there are 110x110x96 output activations for this layer). If this is the case then CNN units differ substantially in definition from FC units: for CNN, one unit maps to one coefficient but for FC units one unit maps to many input coefficients and one activation.

The choice which is not an arbitrary one is the size of each of the patches visualized. For all the hidden units in the first layer, the size of each of the patches would be 3 x 3 x 3 .

This doesn’t seem to agree with the layer 1 images themselves which visually (what I see when looking at them carefully on the slide) are higher resolution than 3x3xRGB. I think they might possibly be 6x6xRGB but realistically I’d guess they are more like 20x20xRGB. Of course, the image patches are not the filter coefficients but rather inputs which cause the filters to output the highest values, so we could show an arbitrary vicinity around the exact pixels that “rang” the filter just right.

Thanks Elemento. I was able to complete the rest of the course before I posted the issues, but I’d still like to understand these aspects better, and would like to save future students some head scratching. I appreciate your willingness to sort this out!

Hi all!

Thanks for all the comments and I appreciate the nice debate. Sorry for a late reply.

Regarding the confusions, I think all of them can be explained by careful reading of the paper that is cited on the slide:
Visualizing and Understanding Convolutional Networks (arxiv.org)
However, I agree that this might not be clear to the learner when just watching the video. Thus either the learner needs to read the paper to fully understand the video, or we can add some clarifying information to the video. I will look into the latter option.

Let me first clarify what you see in the video:

  1. The numbers shown on the slide (224x224x3 for the image, 110x110x96 for the first layer) are indeed the dimensions of the input/output from each layers. So the input to the NN is a 224x224x3 image and the output of the first layer is 110x110x96. How do we get to this output? 96 is the number of filters and 110x110 is the size of the feature map when you pass the filter over an image. The filter sizes are described in the paper: 7x7 with stride 2 and then maxpool 3x3 with stride 2.

  2. the “units” in the video refer to feature maps. (unit=feature map). In the paper they are called feature maps. Not the filters, nor the individual numbers in the filters, but the full feature maps. Perhaps a unit is not the best choice of the word, but maybe (unknown to me) it is used this way in this context. I need to research on that.

  3. The number 9 is indeed an arbitrary number as you suggested. So what they do is look at various feature maps and look what is that most strongly activates them. So for visualization purposes, they chose 9 different feature maps to show this. but then they choose for each of the 9 feature maps 9 patterns that highly excite them.

I imagine this as having an image with just perfect horizontal stripes and just two filters: horizontal stripe filter and vertical stripe filter. Horizontal stripe filter will lead to strong activations in the feature map, while the vertical stripe filter will show nothing.

With larger number of filters (96 in the first layer) you cant expect to show all of them, because it would be just a mess, so you show the most obvious ones and thus 9 is arbitrary. they could show 3 or 5 or 15. I assume they used 9, because it is 3x3 and you can make a nice square image in your research paper. Same for the number of example patterns for each of the filters, you just show 9 examples that give the strongest activations for the 9 filters you chose to show.

so to summarize the paper, what they do is they look at feature maps and look at which ones are the most excited by which images (and choose 9 filters and 9 patterns for each of the most excited ones to visualize).

We will discuss it and add some clarification to the video. Let me know if this is sufficient, or if you believe my understanding is incorrect.

Best, Jan