What is the network architecture for Neural Style Transfer?

A critical piece is missing in the explanation of Neural Style Transfer - what is the network architecture? My guess is that it is a Siamese network with three input images - the generated, the content and the style. Note that this would be a very unusual architecture, because the parameters being trained are in the input layer (i.e. the generated image). Am I correct?
Also, can you train the parameters of the deeper layers simultaneously with training the generated image?

1 Like

Hi, interesting question. You don’t actually need a ‘model’ as such. You’re not training with multiple content, styles and generated images. Style Transfer is an algorithm that works with just 2 images, Content and Style. I.e You could give me any 2 images: one content and one style and I could run style transfer for you, provided I have some pretrained image classifier at hand just for the encodings, the image classifier could be anything, all I care about is that the image classifier has learnt to keep meaningful extracted data from the original image that holds information about the content and style. I then calculate content loss, style loss and randomly initialize a generated image and work on reducing the losses. Et Voila, over many iterations of reducing the loss, the generated image will be my desired output.

1 Like

There is no input/output kind of model. You are basically minimizing loss with just 2 images’ encodings. I suggest you think of neural style transfer as an algorithm which is an application of CNNs than a model/architecture in itself.

Thank you for the explanation. I think that this is a critical point and it should be stated explicitly in the videos, as it is not self-evident.

I would think the results of the neural style transfer might depend somewhat on the images in the training set for the pre-trained image classifier. For example, if the classifier is trained on images of faces vs text vs sign language. Does anyone have experience confirming this?

The results of training always depend on the contents of the training set.

1 Like

Performance of NST does indirectly depend on the training images, but I think it is important to remember how that information is retained and used during NST, since the training set inputs are long gone whenever you’re doing transfer learning. Clearly, it is through the learned weights. And they, in turn, depend not only on the original training set but on both the original model / architecture and the loss function that was used during training. I think the statements above that “there is no input/output kind of model” and “the image classifier could be anything at all” undervalue that importance. Without an underlying model with useful feature extraction layers, and without weights learned from a suitable loss function, NST outputs won’t be pretty, either aesthetically or from a computer science standpoint. NST also depends heavily on the measurement of style similarity on the two input images (Style and Content). The paper referenced in this exercise, and its code, use the Gram matrix. But that isn’t the only choice available (see the paper Neural Style Transfer: A Review linked below).

The ability of NST to produce interesting output does depend on the original training images, but also on the original classifier network architecture, the original loss function, the learned weights retained in the trained model, choice of style and content layer in the classifier network, and the choice made for measuring Style similarity.

Here are some related links for further contemplation:

The original VGG paper: https://arxiv.org/pdf/1409.1556.pdf
An implementation of VGG-19 in Python and Keras: deep-learning-models/vgg19.py at master · fchollet/deep-learning-models · GitHub
The Gatys et al paper: https://arxiv.org/pdf/1508.06576.pdf
A review of style transfer, both before and since the Gatys paper: https://arxiv.org/pdf/1705.04058.pdf

1 Like