C4W1 Lab1: style loss weights correctly implemented?

Hello,

According to the lecture and the Gatys et al. paper, the style loss E_L of layer L is defined as follows:
E_L = 1 / (4 * N_L^2 * M_L^2) * S,
where S is the sum of (elementwise) squared differences of the Gram matrices of the Lth layer of the style image and the generated image (page 106 of the Lecture Notes).
Here, N_L is the number of channels of layer L, and M_L is the number of pixels in each channel (M_L = h*w, h is the height, w is the width).

I cannot figure out how the 1 / (4 * N_L^2 * M_L^2) weight is correctly implemented in the code.

First, there is the get_style_loss() function for computing the style loss for one layer. The comments of this function say that it expects two images of two images of dimension h, w, c (height, width, channels of layer L). I think it is incorrect, since the inputs should rather be two Gram matrices of shape (c, c), i.e., (N_L, N_L) with the notation further above. Regardless of this, the function does what it should do, although it performs a reduce_mean rather than the reduce_sum stated in the lecture. By using reduce_mean, we have actually accounted for the 1 / N_L^2 coefficient in the style loss (see above).

When we use the gram_matrix() function to compute the Gram matrix of a layer’s activation, we scale the result with the number of ‘locations’ (i.e., divide by height * width). When we take the element-wise squares of this scaled Gram matrix inside get_style_loss(), we thus account for the 1 / M_L^2 coefficient in the style loss.

If I followed correctly, we have thus accounted for 1 / (N_L^2 * M_L^2) of the 1 / (4 * N_L^2 * M_L^2) weight for the style loss (of layer L) so far. But where is the remaining 1/4 coefficient accounted for? I have not managed to find it out in the code, and I think it is actually missing. Or else, what do I miss?

Best,
Istvan

1 Like

Thanks for your question. If I follow correctly, the get_style_loss() function does require two images as input instead of two gram matrices based on their usage later.

Since different losses are (almost always) combined together via a weighted fashion, in the the case of get_style_content_loss function that accepts the weighting factor for both style and content losses, I think the scaling factor 1/4 or using either reduce_mean or reduce_sum would result in any big difference; ultimately the weights are hyperparameters that need to be “hand-tuned”.

They require two gram matrices and not two images, please check the parameter flow of the functions to see why:

The function get_style_loss() is called inside the function get_style_content_loss(). Here, the elements of the lists style_targets, style_outputs, that entered as the first two parameters of get_style_content_loss(), are directly passed into the argument of get_style_loss(). Let’s see what the elements of these lists are and where they come from:

The function get_style_content_loss() is called within calculate_gradients(). Here, the first two parameters of get_style_content_loss(), i.e., the lists in question are named style_targets and style_features. The list style_features is returned by get_style_image_features() within calculate gradients(). The other list, style_targets, is passed in as a parameter, but it is also returned by get_style_image_features(), outside of calculate_gradients().

Since get_style_image_features() returns gram matrices, the parameters of get_style_loss() are indeed gram matrices.

If it is hard to follow, an easy way to check this is to insert the print(tf.shape(features)); print(tf.shape(targets)) statements into the block of get_style_loss(). As you run the training loop, it will return shapes such as (1, 64, 64), (1, 128, 128), etc. i.e., the shapes of the gram matrices in question.

In my opinion, the code is not very easy to read due to the large number of functions calling each other, and the misleading comments in get_style_loss() do not help. It would be nice if those were corrected.

Great analysis and thanks for state it clearly. I agree with you and the 1/4 coef. was missing without explanation