VGG vs ResNet: Gradient ascent for generating an image that maximizes the class score with tensorflow and keras


languages: Python 3 and tf.keras 2.15.0

I’m working on an XAI project and I want to experiment with ResNet for generating images that maximize a class score or activation. This process is generally understood through gradient ascent. However, I’m struggling to get high-res and human-interpretable images back out of ResNet50v2, where it is relatively easy with VGG19. I want to know 1) if I’m doing something wrong with how I’m handing the grad-ascent with resnet, 2) if there is a fundamental reason why resnet is not expected to give good results, and 3) what I could change to get better realistic color for each channel.

Here are some resources I’ve used to build my code:

  1. Visualizing what convnets learn
  2. GitHub - utkuozbulak/pytorch-cnn-visualizations: Pytorch implementation of convolutional neural network visualization techniques
  3. Aman's AI Journal • CS231n • Visualizing and Understanding
  4. https://cs231n.stanford.edu/slides/2021/lecture_14.pdf
  5. stanford-cs231n-assignments-2020/assignment3/NetworkVisualization-TensorFlow.ipynb at master · amanchadha/stanford-cs231n-assignments-2020 · GitHub

If you look into the above references, you’ll notice that none of them use resnet to generate the input image for a target category. Is this because it’s generally known that resnets make poor image generators using gradient ascent?

Following the above resources, I’ve written code in tf.keras (link to colab notebook) to create images with pretrained VGG19 and ResNet50V2. Through exhaustive experimentation, I’ve found some hyperparameters that work, and also found that the images produced by VGG19 are far more distinct and human interpretable than those produced by ResNet.

What I’d like to know is:

  1. Is there an implementation out there that gives an example of how to get better grad-ascent results with resnet?

  2. Is there a fundamental reason that non-residual networks are better suited for producing human interpret-able images as compared to a ResNet, or am I doing something wrong with my grad-ascent process?

  3. If possible, what can I do to improve my process with resnet?

  4. Is there a better way that I can generate real-world coloration during the grad-ascent process for both models (i.e., better handling of channels)?

Link to my colab notebook