A doubt in visualizing deep network

In your example, I think that you are right in that both the 30 and -30 values should be considered “highly activated” for that particular ConvNet layer. With that said, this is a contrived example. In practice, the “ReLu” function is usually used for activation, and so the learned filters would output large positive values rather than negative values.

However, I don’t think it’s correct to sum up all the values. It’s more correct to pick the max value (or top N max values) in that ConvNet output, and then figure out which input values (or “patch”) contributed to that max value. There are many max values (30) in your case, but I think the max values are less likely to be duplicated in practice.

Specifically, in the given example with 30 as max values, the following 2 inputs/patches would result in a “highly activated” value:

[ [ 10 10  0 ]
  [ 10 10  0 ]
  [ 10 10  0 ] ] 

[ [ 10  0  0 ]
  [ 10  0  0 ]
  [ 10  0  0 ] ] 

If you have multiple ConvNet layers, you can keep doing this operation backwards to figure out which patch in the original images resulted in the highly activated values (30 in your example).

I found the paper that talks about how this is done in practice. There’s also an online book that I think does a pretty good job explaining this.

I gave the papers a quick read, and the basic idea seems to be to use a “DeConvNet” that reverses the operation of a ConvNet (just do the operation, but backwards), while also keeping track of the mapping of outputs/inputs that results in the highest activation value.

1 Like