“if you pick one hidden unit and find the nine input images that maximizes that unit’s activation, you might find nine image patches like this”, quoted from lecture
I don’t think they described how this was implemented in the lecture. Are you just curious as to how this is done?
There might be a formula, but I think you can implement this pretty easily in the code. You can simply just loop over all the samples you are given, forward propagate each one until you reach the hidden unit, and then keep track of the sample indices of the largest activations for that hidden unit.
Reason your answers not just with the numbers, but also what those filters do - you know the meaning of the filters, and you know what the inputs visually look like. We can’t ignore the meaning.
After that, can you explain why there is the negative sign? Please take the visual of the input and the meaning of the filter into account for your explanation. What is the difference between the two inputs that could contribute to the negative sign even though they are processed by the same filter? How can you change the filter so that a becomes -240 and b becomes +240 (switching of signs)?
Previously, you measured the activation of a unit with respect to an input by the above formula.
My last question is, could you propose another measurement that can correctly inform, out of a, b, c, and d, which two should be maximally activated and which two minimally?
To visualize 9 image patches that maximize the hidden unit’s activation. (one image patch is extracted from one image, choosing 9 images in dataset that maximize the hidden unit’s activation), but i don’t know how to define an activation which is more “highly activated” than another.
The negative signs are shown in the activation b because the input image is formed in the way of dark-bright transition, the filter is trying to find an edge in the input image which has bright pixels on its left and dark pixels on its right but discovers that there is an edge in the input image which has dark pixels on its left and bright pixels on its right.
The shade of transition of input image a and input image b are different so that activation b could contribute to the negative sign even though they are processed by the same filter
Activation b can be positive signs when the filter is changed by flipping its position as follows:
The yellow-highlighted part in the activation represents the edge extracted from the input image. May i know what does the grey-highlighted part in the activation represent? I thought that represents the background of the edge in the input image but it seems not.
In your example, I think that you are right in that both the 30 and -30 values should be considered “highly activated” for that particular ConvNet layer. With that said, this is a contrived example. In practice, the “ReLu” function is usually used for activation, and so the learned filters would output large positive values rather than negative values.
However, I don’t think it’s correct to sum up all the values. It’s more correct to pick the max value (or top N max values) in that ConvNet output, and then figure out which input values (or “patch”) contributed to that max value. There are many max values (30) in your case, but I think the max values are less likely to be duplicated in practice.
Specifically, in the given example with 30 as max values, the following 2 inputs/patches would result in a “highly activated” value:
If you have multiple ConvNet layers, you can keep doing this operation backwards to figure out which patch in the original images resulted in the highly activated values (30 in your example).
I found the paper that talks about how this is done in practice. There’s also an online book that I think does a pretty good job explaining this.
I gave the papers a quick read, and the basic idea seems to be to use a “DeConvNet” that reverses the operation of a ConvNet (just do the operation, but backwards), while also keeping track of the mapping of outputs/inputs that results in the highest activation value.
Since our mentor @hackyon have shared those great references, I will try to align the following discussion of one of your questions to that.
The output on the right hand side is called a feature map, and each number in it represents the detection of such feature - vertical edge in this case. Therefore, I think the zeros, or the grey area, simply means they don’t detect the vertical edge (that the filter means to detect).
The method in those references will reverse the convolutioned features into pixels. If we think about this process with the zeros you questioned about, it is going to be a good guess that those zeros, when reversed back to pixels, are not going to express any features in result, but only those 30s will.
In fact, if we look just at your feature map on the right hand side, we can tell there is a vertical edge in the center, although we should be careful probably not to make comment like this when it comes to examine a middle layer’s feature map in a multilayer convnet.