Understanding week 4 programming exercise: Neural Style Transfer, section 5

I’m having some trouble following along with the code in the second programming exercise for week 4, in particular section 5.5.

I’m unsure as to why in section 5.4 we do
content_target = vgg_model_outputs(content_image)

and then appear to redo the same thing again in section 5.5.1 with

preprocessed_content = tf.Variable(tf.image.convert_image_dtype(content_image, tf.float32))
a_C = vgg_model_outputs(preprocessed_content)

but this time on a version of the content_image converted to a tf.Variable. What is the difference between these two passes of an image to vgg_model_outputs and what are they for?

Also, in section 5.5.1, it says we “Set a_C to be the tensor giving the hidden layer activation for layer “block5_conv4” using the content image.
However, I can’t work out from the code how a_C = vgg_model_outputs(preprocessed_content)
specifically selects the block5_conv4 layer’s activation. It just looks like we’re passing the preprocessed_content image to the vgg_model_outputs model and getting the activations from all layers, rather than specifically block5_conv4. How does this code select block5_conv4?

For (hopefully) some clarity on my current (lack of) understanding, this diagram shows how I think the pieces fit together:

Hello @James_Bur,

Let’s do an exercise!

What I am going to do is to suggest some inspection steps in the hope that you will be able to look at those code from a different angle, and finally get to some answers yourself.

It is not necessary, but if you prefer to verify any or all of your answers with us, please feel free to share. :slight_smile:

  1. Please print out the content_target and preprocessed_content and identify their difference. You may further print(vgg.summary()) to verify that the model does not come with any normalization layer. (Edit: I meant to compare between content_image and preprocessed_content, but not content_target)

  2. Please verify that, among content_target and a_C, which is actually reused, and which is never reused. If something is never reused, it might not have any implication.

  1. Please check out the code cells in section 5.4, particularly for the line content_layer = [('block5_conv4', 1)] which is highly correlated with the goal in question. Examine that line’s implication to the get_layer_outputs() function which actually produces vgg_model_outputs that claims to achieve the goal in question.

Whatever I told you here are already done by myself during my investigation to your questions. Please try them out.


I had a look at them:
preprocessed_content is the input content image converted to a tf.Variable and is of shape (1, 400, 400, 3).
content_target is a tf.Tensor of shape (1, 400, 400, 64) comprising the activations of the STYLE_LAYERS + content_layer selected set of layers of the vgg model when fed preprocessed_content as its input.

It appears that content_target is not reused anywhere later on, whereas a_C is used in the later gradient optimisation. This leads me to ask: why do we compute content_target if it’s never used anywhere?

Also, comparing content_target and a_C (before having run any of the later gradient optimisation etc), their values appear to differ:

I’m not sure why they differ if they’re both the activations of the same set of model layers when fed the input content image. The only difference I can think of is that, since a_C is generated using vgg_model_outputs(preprocessed_content), wheras content target is generated by doing vgg_model_outputs(content_image), the difference must lie in that preprocessed_content has had its datatype converted to 32-bit float using tf.image.convert_image_dtype wheras content_image has not been converted. So the vgg model responds differently to the two inputs. Is this correct?

I understand that content_layer = [('block5_conv4', 1)] adds that layer to the set of layers passed to get_layer_outputs to generate the set of activations vgg_model_outputs. What I was confused about is how in section 5.5.1 it says:

Set a_C to be the tensor giving the hidden layer activation for layer “block5_conv4” using the content image

but then in the code we have a_C = vgg_model_outputs(preprocessed_content). In the previous section (5.4), vgg_model_outputs is generated not just from “block5_conv4”, but also the style layers, i.e. vgg_model_outputs = get_layer_outputs(vgg, STYLE_LAYERS + content_layer). So it appears actually a_C is not just "the tensor giving the hidden layer activation for layer “block5_conv4” " as the rubric claims, but also contains the hidden layer activations for the STYLE_LAYERS as well.

Why do we not need two separate vgg_model_outputs, one with the activations for content_layer and another for the activations for STYLE_LAYERS, in order to compute a_C and a_G?
So something like:

vgg_model_outputs_C = get_layer_outputs(vgg, content_layer)
vgg_model_outputs_S = get_layer_outputs(vgg, STYLE_LAYERS)

a_C = vgg_model_outputs_C(preprocessed_content)
a_S = vgg_model_outputs_S (preprocessed_style)

Look more carefully at how the outputs are actually used by the two cost functions. Notice that they select the last index when dealing with content and strip off the last index when dealing with style.

1 Like

What about their values? That would be a very visible difference. I mean the values in preprocessed_content and content_target .

This was an ah-ha moment for me :slight_smile: I missed the difference between the slicing
a_S = style_image_output[:-1]
a_C = content_output[-1]
in the style and content cost calculations. A very important colon! Thankyou!

The preprocessed_content values are within the 0-1 range. This is consistent with it being the input content_image after converting from a uint8 tf.Tensor into a float32 tf.Tensor using tf.image.convert_image_dtype.
My previous statement about content_target being (1, 400, 400, 64) was actually not correct; it’s in fact a set of tf.tensors with shapes and value ranges:

(1, 400, 400, 64), range: 0.0, 771.5628
(1, 200, 200, 128), range: 0.0, 3228.934
(1, 100, 100, 256), range: 0.0, 6436.997
(1, 50, 50, 512), range: 0.0, 15041.766
(1, 25, 25, 512), range: 0.0, 2772.1821
(1, 25, 25, 512), range: 0.0, 188.24327

This is consistent with content_target being the set of activations of the 6 STYLE_LAYERS + content_layer layers in the vgg network when it is fed content_image as its input.

However, I’m not sure if I understand how comparing preprocessed_content and content_target relates to my question, which was about the difference between content_target and a_C, or indeed why we bother computing content_target if we don’t appear to use it afterwards but instead use a_C.

To reiterate and hopefully clarify my question, we do:

content_target = vgg_model_outputs(content_image)

and then later:

preprocessed_content = tf.Variable(tf.image.convert_image_dtype(content_image, tf.float32))
a_C = vgg_model_outputs(preprocessed_content)

but, as I think I understand it, we never re-use content_target and instead use a_C in our optimisation, in which case why do we compute content_target?

I can now at least partly answer my original question:

However this still leaves the question of why we bother computing content_target if we don’t use it anywhere.

I think this is just an oversight on their part. Or maybe they just wanted to show an example of how to invoke the function that they just created. You’re right that it doesn’t serve any computational purpose as the output is never used.

1 Like

Hey @James_Bur,

Oh, I am sorry that I have mixed up the variables. I meant to compare between content_image and preprocessed_content.

content_image has a range of 0 to 255 whereas preprocessed_content has a range of 0 to 1. If you have checked out print(vgg.summary()), you will find that it does not have a normalization layer to convert the range into 0 to 1. Therefore, we make and we use a_C instead of content_target. And as you and Paul have commented, content_target is actually not necessary.

I am sorry for mis-typing the variable names and if I have wasted your time.

Sorry again.