Neural style transfer 'trains pixels'


Quoting from the C4W4 quiz - ‘No, Neural style transfer is about training the pixels of an image to make it look artistic, it is not learning any parameters.’

What does it mean to say that the neural network ‘trains pixels’ and that there is no learning. As far as I could see there is a cost function that we’re trying to minimize with back prop. Wouldn’t that involve tuning some weights and biases in the convolution layers to get parameters that tweak an input image appropriately?

I should probably read the paper, but thought to ask here as well… for a gentler introduction to the concepts :slight_smile:


Hey @Nidhi_Sachdev,
You can answer your query by yourself, by just thinking about a simple question. In the assignment, we have defined the train_step function for training purposes, and the only model that can be found in the assignment is the pre-trained VGG-19 model. Now, if you carefully look inside the train_step function, we aren’t fine-tuning any layers of the VGG-19 model (the only model that exists), so, “What do you think is getting trained if there is no model?”. The answer is the pixels of the generated image.

Another way to answer your query could be to look at the process of generating 2 different images. Had there been any model training involved, we would have to load the model again, but you can easily generate 2 images without have to load the model again. You just need to initialize another generated image, and voila.

Lastly, a solid explanation can be provided by understanding the code inside the train_step function.

    grad = tape.gradient(J, generated_image)
    optimizer.apply_gradients([(grad, generated_image)])

As can be seen above, the gradients are calculated with respect to the generated_image, and when they are applied (i.e., backprop), they are optimizing generated_image, which basically represents the pixels of the generated image, and not any layers of the pre-trained VGG-19 model. I hope this helps.


Thanks for the explanation … this is a strange concept to wrap my head around. But yes, I do see now what is meant by the statement from the quiz. I didn’t get to the programming assignment yet, working my way up to it. It should help clarify things more…

After working through the assignment, I think I might have a better handle on the concepts.

The first hint for what is happening here is in the name ‘neural style transfer’ , indicating this is an example of transfer learning, (where a pre trained neural net is used) to create a generated image combining a content and a style image.

A randomly initialized generated image is input through the pre trained neural net, with the idea to compare activations it produces with those produced by the content and style images. So the new idea here (for me anyway) is that we’re comparing internal layer activations (vs the final output of the nn). With the idea that we use back prop (thus ‘training’) to update the generated image (‘pixels’) to minimize difference between its activations and those produced by style/content images.

Posting here incase others find it helpful, I’m sure this was mentioned in the lecture and I just didn’t catch on…