Neural Style Transfer - Why do we pick a hidden layer from our content and style images?

I know this was explained in the lectures; it’s because the middle layers have learned enough detail. However, I wanted to clarify the intuition behind not using the images themselves.

Why do we not want the generated image to be compared to actual content and style images? Is it because the generated image is only supposed to capture the “essence” of both images? And would using both of the actual images make training impossible, as we cannot generate an image that combines both of the actual images?

There are several layers (pun intended) to the answer here:

At a high level, everything here is experimental. I’m sure the authors of the paper tried lots of experiments with how to implement this idea and what they ended up publishing was the thing that works best. Of course this is art, so “best” is by definition subjective.

Also note that in the lectures and in the notebook, they explain this and also make the comment that you can try your own experiments with varying which hidden layers you sample and see how that affects the results.

But to recap the explanation they give in the notebook: the internal layers of a network “learn” to recognize different things and different levels of detail in the image. The earlier layers learn more primitive features like lines and curves and color patterns and the later you go into the network, the more complex are the features that are being recognized (the shape of a cat’s ear or eye or tail). So you can sample from different layers to get a mix of primitive and less primitive features. For more intuition on that sort of thing, we just had the very cool lecture from Prof Ng earlier in Week 4 titled “What Are Deep ConvNets Learning?” It might be worth watching that again if you are still feeling unsure about what information you get by sampling the internal layers.

Also think about what you mean when you suggest just using the final images. So how would you implement that? What is it that you extract from an actual final RGB image of a painting? It’s just raw data. You could just copy it, but that wouldn’t be very interesting, would it? The whole point is you somehow want to extract a “style” from the one image and then adapt the given image using that style. How would you do that if you just had a raw RGB image? Maybe you could implement some sort of interpolation between the two images. You can try that and see what kind of results it would give.

And finally, if all the above doesn’t feel complete or convincing, you could try to just watch carefully what Prof Ng says again in the relevant lectures with all the above in mind. Maybe with all that as background, the “picture” (pun intended again) will come together the second time through. :art: :nerd_face:

1 Like

Actually one other thought here would be that there are a number of techniques for generating synthetic images. If you find this subject interesting, I’d recommend that you consider taking the GANs specialization next after DLS. They go deeply into using Generative Adversarial Networks to generate synthetic images and the techniques they use there do actually work directly from images in a way quite different than what we’re doing here with Neural Style Transfer. GANs are really interesting and definitely worth a look!

1 Like

Thank you so much for the reply! As always, your responses are helpful and informative.

So, your response seems to confirm my intuition. The middle layers capture an “essence” of the images, without turning the generated image into an exact replica of the inputted images.

Speaking of the raw data, from what I understand of your response, the raw RGB image also can’t be used because it hasn’t been run through the model. So, the raw image has not been encoded to be dealt with by the functions we implemented anyways. Correct me if my understanding is wrong.

Yes, I think that’s a good way to describe what is happening here.

1 Like