They do talk about using middle layers here in the notebook as well. Here’s some verbiage from early on:
To choose a “middle” activation layer 𝑎[𝑙]a[l] :
You need the “generated” image G to have similar content as the input image C. Suppose you have chosen some layer’s activations to represent the content of an image.
- In practice, you’ll get the most visually pleasing results if you choose a layer in the middle of the network–neither too shallow nor too deep. This ensures that the network detects both higher-level and lower-level features.
- After you have finished this exercise, feel free to come back and experiment with using different layers to see how the results vary!
The code here is pretty confusing and hard to read, but I think you’re just missing the details of what they are doing. Notice that first they define this:
STYLE_LAYERS = [
('block1_conv1', 0.2),
('block2_conv1', 0.2),
('block3_conv1', 0.2),
('block4_conv1', 0.2),
('block5_conv1', 0.2)]
So that gives you five conv layers spread throughout the predefined model layers. Then later they add one more layer which is block5_conv4
:
content_layer = [('block5_conv4', 1)]
vgg_model_outputs = get_layer_outputs(vgg, STYLE_LAYERS + content_layer)
So vgg_model_outputs ends up being a pre-defined function that will return a list of 6 activation outputs from those 6 different selected layers of the network.
To see what is going on, I added some print statement to this template cell from the notebook:
# Assign the content image to be the input of the VGG model.
# Set a_C to be the hidden layer activation from the layer we have selected
preprocessed_content = tf.Variable(tf.image.convert_image_dtype(content_image, tf.float32))
a_C = vgg_model_outputs(preprocessed_content)
print(f"len a_C {len(a_C)}")
for ii in range(len(a_C)):
print(f"shape a_C[{ii}] = {a_C[ii].get_shape()}")
Here’s what I get by running that:
len a_C 6
shape a_C[0] = (1, 400, 400, 64)
shape a_C[1] = (1, 200, 200, 128)
shape a_C[2] = (1, 100, 100, 256)
shape a_C[3] = (1, 50, 50, 512)
shape a_C[4] = (1, 25, 25, 512)
shape a_C[5] = (1, 25, 25, 512)
tf.Tensor(0.01646365, shape=(), dtype=float32)
So everything they are doing in train_step
with a_C, a_S and a_G is using those same 6 layers including 5 “middle” layers and then the last conv layer before the final pooling layer.
I guess the other subtlety here is that the weights of the first 5 “middle” layers are set to 0.2 apiece, but the final conv layer gets a weight of 1. So that means they are giving it as much influence on the result as the other 5 together. As they say in the notebook, once we have all this working we can then try experiements with both which layers we use and the relative weighting. Disclaimer: I have not personally tried any such experimentation.