Since a kernel only uses an image patch for an output neuron, how do I get the image patch for a maximally activated neuron in a layer as Andrew showed in Course 4, W4, Lecture 2? Any guidance would be appreciated.
In the lecture, Prof Ng explains how it was done and also gives the reference to the original paper he is talking about. It is Zeiler and Fergus, Visualizing and Understanding Convolutional Networks. I have not read the paper, but I would expect that they describe their methodology.
Yes, going through the paper now.
Great! Let us know how that goes. I did a quick scan of it and it looks like they just describe things in words and that it may take a bit more digging and reading the other paper on “deconvolutional” networks. I assume that’s an old school name for what we now call “transpose convolutions”, but not sure. The paper is from 2013 so terminology may have evolved a bit in the intervening 10 years.
And I forget whether GitHub had been invented at that point in time. I was hoping for a link to a repo with some source code, but did not see that.
I haven’t found any Github repo either. There are some Github repos that refer to that paper but no one has a proper/actual implementation for “image patch”.
I read the paper. It’s an enjoyable read. They use max unpooling, relu, and transpose conv to reconstruct the input in each layer and back propagate until reaches to the pixel space.
I found a GitHub repo that uses vgg16. Ran the code. GitHub - huybery/VisualizingCNN: 🙈A PyTorch implementation of the paper "Visualizing and Understanding Convolutional Networks." （ECCV 2014)
Here’s what I got for the final layer’s most activated neuron:
It doesn’t give me the image patch coordinates as a bounding box.
@paulinpaloalto Is it possible to exactly compute the coordinates of the image patch that is responsible for firing the maximally activated final layer neuron?
Is it even possible to get back (or reconstruct ) the patch of the original image?
Interesting! Well, I don’t have any particular expertise here and have not really read the paper, just skimmed it. But my impression from the pictures they show and from what Prof Ng said in the lectures is that it should be possible to compute the input patch that triggers the maximum activation. But note that conceptually if you take it all the way to the final answer, then it’s getting fed through some fully connected layers and you’re also at the level of “Cat” or “Not a Cat”, so the whole image is really what triggered that: it’s not one portion of the image, it’s the entire thing. All the internal state is integrated into the final “yes/no” answer. What Prof Ng was discussing was about probing the internal conv layers so that you could get a sense for what was being learned at the various hidden layers as you proceed through the network (e.g. simple features like edges at the early layers and more complex structures at the later layers). But if you think about what the inputs are that produce the output of a single application of a filter at a hidden layer, it’s clear what it is in terms of the immediately preceding layer’s input, but then each of those elements in the input comes from multiple outputs of the preceding layer, rinse and repeat. Does the paper discuss how they do that “induction” through multiple preceding layers?
Was there any documentation or explanation associated with the code in the VGG16 based repo that you found?
That’s exactly what I want to do. To get a better understanding of what the conv layers are learning. But I also want to get back the image patch.
If there’s a way to get back the image patch, as Prof Ng suggested, I have to find a way to do that.
The question is where do I even ask for help for this purpose? I am also not an expert in convnet. Just started learning recently.
The idea would be for each output value for an output channel I have to save the corresponding start and end locations of the previous layer’s patches during forward propagation.
The paper discusses it. But on a higher level. They just describe the approximate maxunpooling process by saving the max locations from the previous layer during forward prop, relu, and filtering (the exact filter during forward prop) for the deconvnet.
There is no documentation. Just a reference to the paper.
Well what did you learn from reading the paper? It looked like they just talked in words about what they actually did. If that’s not enough to give you a starting point, then the other approach is to actually dig into the code of the implementation you found and try to understand what it’s doing and see if it has the mechanisms you want. The high level point is that this was “bleeding edge” original research as of 2013, but that doesn’t mean that it’s just a trivial thing in 2023: it may still require some “heavy lifting”.
Of course this is beyond the scope of this course, so there is no guarantee that anyone listening here has done this type of investigation before. It sounds like a totally interesting and worthwhile investigation and it seems likely that a lot would be learned in the process.
They talked in words mostly.
I will go through the code I found on GitHub. I agree, I think it’s worthwhile and it’s related to my research.