Whats the differences between style cost func and contect func?

Hi friend and mentor,

Like the pic below, my understanding is, both cost functions are almost the same, they just do F normal (or find the distance) between two obj (S.G or C.G).

  1. the biggest difference is, the content func only cares about ONE picked layer (somewhere in the middle), but the style cost func is running all layers. Is this correct?

  2. content one doesn’t do correlation with each other, it’s just the regular norm, not F, but style func needs correlation , and F norm?

The equations for the style function are clearer for me, but the content one is not.

thank you!



hmm… anyone could help?

This was all explained both in the lectures and in the assignment notebook. I’ll give my summary of what I have from the notes that I took while watching the lectures. Then what I would suggest is that you go back and watch the lectures on this again with the following thoughts in mind:

In all this, we depend on a pretrained image classification network. For the purposes here, they use VGG-19. Then we start with two input images: the content image and the style image. The goal is to create a “stylized” version of the content image by applying the artistic “style” from the style image to the content image. For this purpose, you want to “subtly” modify the content. So how they express that is that they use a basic “distance” metric between our “generated” image and the original input content image. The way they do this is to pick one of the inner hidden layers of VGG-19 and extract the activations from that hidden layer for both the content image and our current generated image and then basically just compute the square of the 2-norm of the difference between those two. In other words, we want the generated image to be “close to” the content image in a pretty objective “Euclidean distance” way.

But then for the style, the goal is to do something more subtle. For that, rather than using a direct distance metric as they do in the content case, they decide that one way to express the meaning of “style” is to take the activations at a hidden layer of VGG-19 processing the “style” image and then compute the Gram Matrix of those activations, which is a version of a correlation matrix, but they do it in such a way that it computes the correlations for the channel dimensions of the image, basically “averaging out” the geometric dimensions. Of course the point here is that we’re looking at hidden layers in the VGG-19 model, so the “channels” are no longer just simple colors (RGB) as in the input images. Also note that the typical pattern in a convnet is that the number of channels increases as you go through the hidden layers and the geometric dimensions tend to reduce. So the channels being used represent derived “features” of the sort that Prof Ng was discussing in the lecture “What Are Deep ConvNets Learning” in this Neural Style Transfer section of C4 Week 4.

They must have done some experimentation in this process and realized that they could get better results by making the style cost be the sum of the above type of “channel correlation” cost across multiple different hidden layers.

That is what the style cost is expressing: trying to encourage those correlations expressed in the style image onto the generated image rather than directly computing the Euclidean distance as in the content case.

Ok, with all those ideas expressed, can I explain intuitively why it all works? No, sorry, I don’t think I can really say anything meaningful there.

I really think that what you should do is go back and listen to what Prof Ng actually says with the above ideas in mind. It’s not supposed to be my job to watch the lectures for you. You need to watch them for yourself. There is no shame in watching them more than once if not everything Prof Ng says sinks in the first time through. This is the culmination of Course 4 and nobody is saying that any of this is simple or obvious material. It takes work to understand it all.

3 Likes

Wow, Paul. First of all, thanks so much for those details. What a kind man:)

I think I made a mistake in the title or describing my question. All I want was to double-check my understanding on the equations. So, for the content cost equation, I just need to pick ONE channel only instead of running overall channels like the style cost function, right? I think the answer is yes, you mentioned it as well.

This is a little confused me. I think I understand the style function, which makes sense because i want the G close to the S (though all CH), in other words, the style in G should be close enough to the S.

But this content cost function, just picks one CH in the middle and measure the distance between C and G. I am not very following the “physical meaning” of it. Maybe I should first go for the coding homework, then come back.

Thanks for the time, Paul. I will make the question more clear next time. Have a good weekend.

I tried to describe the intuition for this part in my previous reply. Think of what the resulting image looks like: it’s recognizable as the same content, right? But somehow it is “smeared around” with artistic brushstrokes and the like. So you want it to be “close” in a pretty straightforward sense to the original content image. In our case, it starts as a photograph and ends up looking like a painting, but it’s clear what the picture is depicting. It’s not just a Matisse painting of some completely different subject, right?

But then how you express the “style” is more complex. That is where they use the channel correlation mechanism.

Yes, they just choose one internal hidden layer output for the content cost, but use a selection of the hidden layers for the style cost.

If you have not yet done the assignment, I think that is the best idea: just work through it and then you’ll get to see concretely what the code does. That is where the “rubber meets the road” and you translate the math formulas into actual code.

1 Like

Will do. Thanks for your time again.