Understanding "what conv nets really learning"

Hi. I’d like to have some clarification regarding what is shown in the “what conv nets really learning” video. Specifically:

  1. Is the 9 “maximum” activations relate to the whole training dataset?Or in another words, each patch from the 9 presented belongs to a different input image
  2. When showing the outputs of neuron in deeper layers, how come they look like some portions of the input image? After all, each “deep” neuron recievs data from all original image pixels, thus I would not expect to have a “sensible” visualization of such a neuron output.

Hello @gilad.danini,

Check out from 0:29, or the following part of the transcript:

Here’s what you can do. Let’s start with a hidden unit in layer 1. And suppose you scan through your training sets and find out what are the images or what are the image patches that maximize that unit’s activation. So in other words pause your training set through your neural network, and figure out what is the image that maximizes that particular unit’s activation.

In short, those are images, and not some learnt representation of the images.

Cheers,
Raymond

Hi, I was searching “image patches that maximize that unit’s activation” and found this thread. Could you please clarify what is meant by ‘image patches’ and why there is an interest in ‘maximizing that unit’s activation’? Thank you!

Hello, @Zijun_Liu,

Please, in the next time, share the lecture’s name and the relevant time mark. Others will more likely prefer this information to be from you than having to infer themself because that takes time and chance.

Now I am assuming you were referring to Course 4 Week 4 lecture “What are deep ConvNets learning”.

For why “maximizing the unit’s activation”, I recommend you to watch the two Course 4 Week 1 lectures on “edge detection example” and “more edge detection”. The idea is that, passing through a vertical line to a vertical line detector (unit or filter in the NN terminology), we get high activation values, but what will happen if we feed it a horizontal line? You probably can guess it but the easiest way for you to verify it yourself would be to repeat the lecture’s vertical line detector exercise with an input of a horizontal line :wink: .

Image patches are examplified on the slides shown between 5:02 to the end of the lecture, which can refer to the whole images or, as we can google the meaning out, smaller rectanglar regions of images. For example, let’s say our input image is of the size 640 x 640 pixels, and in layer 1 there are 8x8 filters (units). The thing is, layer 1’s units can only cover 64 pixels of an image at a time, which means that, if an unit is really maximized, it will just be due to 64 pixels of the whole 640 x 640 image. That 64 pixels is an image patch. I think we can then think about, why would layer 2 cover a larger region than layer 1, as can be seen from the “What are deep ConvNets learning” lecture in deeper layers’ image patches?

Cheers,
Raymond

PS: You probably would want to read the paper, too, for all the details.

3 Likes

The point is trying to “see” what the network has actually learned to detect. The section of an input image that triggers the largest activation value from a given neuron in a given layer gives you a picture of what that neuron has learned to recognize. That is the interesting thing about this work: instrumenting the trained network so that we can visualize how it is working in the internal hidden layers.

1 Like

I don’t know if anyone has done this, but the thought brings me back to an obscure, private note I sent you a long time earlier.

Or, at each data point, or node, into the network-- In my mind ‘if you can hash it, you can track it’.

Just acting on huge sets of numbers as we are now… Does not tell you much about the response, or give one a sense of metrics.

Yes, there would be a good deal of overhead, because now every data point passing through the network has to be an object–

Or, perhaps there is an even more clever idea than mine; I imagine… someway you could both hash and preserve the original value at the same time ?

[remember the value is only a ‘symbol’, some degree or ‘order of magnitude’ – as long as that is preserved, who cares what goes along with it]

I don’t know…

I think of it this way:

  1. What a ConvNet, like any supervised machine learning algorithm, learns, literally, is a set of weights that minimize difference between a prediction and known ground truth.

  2. What a deep ConvNet does at each layer is use these weights and the activation function to downsample its input, condensing information as it goes. In a simple classification example, the entire set of 640x640x3 values (over 1.2 million) is concentrated into information that can be represented by a single binary value.

The learned filter weights at each layer act as a signal processing chain. You can think of the activation maximizing image regions as those containing the most amount of signal for a given filter.

If we were truly visualizing what the ConvNet learns, it would be a bunch of floating point value matrices. Literally true, but not very useful in conceptualizing. Instead, we visualize the portion(s) of the input image(s) that those learned weights propagate through the chain. So what you see in the slides of this video are not the actual learned weight values, but the result of applying the trained filters to some input(s).*

*note that in the linked paper the authors discuss how they used a deconvolutional network to reverse the downsampling and work backwards from a given layer to what the original pixels were.

2 Likes

Thanks, @ai_curious, for pointing out about the deconvolution, as my previous response might give an impression that those image patches are parts of the original images. Instead, they should be results of the deconvolution which might be different from the originals (for example, max pooling is not reversible) but the essentials seem to be pretty well preserved. :raised_hands:

2 Likes