Neural style transfer, programming exercise, compute_content_cost reshape confusion

I am confused about the use of tf.reshape() and tf.transform() in the compute_content_cost() and compute_layer_style_cost() functions, and how the shapes of the matrices in these functions are different.

According to the figure above Exercise 1, compute_content_cost(), the 3D matrix is unrolled into a 2D matrix of shape (n_C, n_H*n_W). However, the additional hints for unrolling state that

To unroll the tensor, you want the shape to change from (m, n_H, n_W, n_C) to (m, n_H*n_W, nC).

Why is that? That’s not what the figure shows. What am I missing here?

In fact, both

tf.reshape(a_C, shape=[m, n_H * n_W, n_C])

and

tf.reshape(a_C, shape=[m, n_C, n_H * n_W])

pass the test. No need for tf.transform(), apparently. More confusion!

Then, in Exercise 3, compute_layer_style_cost(), it is stated that

the desired unrolled matrix shape is (𝑛_𝐶, 𝑛_𝐻∗𝑛_𝑊)

… which is indeed what I would expect. Why is the shape different here? And why do we need to use both tf.reshape() and tf.transform() here, and not in the content cost function?

It’s not transform, it’s transpose, right? That is a very specific mathematical operation. In the first example you show, the point is that the reshape is just the first step. Then you do the transpose. Note that directly reshaping things to the final shape you want is not equivalent and ends up “scrambling” the data. I don’t believe your claim that either form of reshape passes the test. You must not have conducted that experiment correctly: e.g. you changed the code, but didn’t actually click “Shift-Enter” on the changed function cell to get the new code interpreted. So when you called it again you were still running the old code.

To understand why direct reshape without the transpose doesn’t work, here’s a thread from Course 1 about an analogous situation.

Yes, tf.transpose(), of course. Sorry about that!

In the first example you show, the point is that the reshape is just the first step. Then you do the transpose. Note that directly reshaping things to the final shape you want is not equivalent and ends up “scrambling” the data.

First of all – could you please explain this line:

To unroll the tensor, you want the shape to change from (m, n_H, n_W, n_C) to (m, n_H*n_W, nC).

Isn’t this a mistake? Shouldn’t each unrolled matrix have shape (n_C, n_H*n_W)?

So this is what I understand now (also from reading other posts on the forum, among which the one you linked to):

  • my end goal is to get a tensor of shape (m, n_C, n_H*n_W)
  • I need to achieve this somehow using both tf.reshape() and tf.transform()
  • using only tf.reshape() messes up the data (how are we supposed to know that?)
  • tf.transform() is needed to keep the m dimensions correct

I’m afraid this is not enough for me to understand how to solve it… Could you please give more hints?

I don’t believe your claim that either form of reshape passes the test. You must not have conducted that experiment correctly: e.g. you changed the code, but didn’t actually click “Shift-Enter” on the changed function cell to get the new code interpreted. So when you called it again you were still running the old code.

I just tried it again, twice, and it’s really what is happening!

Sorry for the confusion – I’ve spent too much time on this already…

I explained that in my first reply: what they show you is only the first step that is required. Then you need the transpose to get to the final desired shape.

The point is that you need to preserve the “channels” dimension in the output, so that it ends up as the first dimension. Directly doing the reshape does not do that. It is the same as in the “flattening” example I gave earlier. Please read that thread again and study carefully how reshape works.

But if you had just followed the hints that they gave you of first doing the reshape to n_H * n_W, n_C then it would have worked, even without really understanding why. But I grant you they could have explained more thoroughly.

Ok, let me run some experiments and see if I can reproduce your results.

Hm. I am completely lost. You know, the instructions and hints here probably make perfect sense if you know what should be done – but if you don’t, they can be extremely confusing. And I don’t want to just follow the hints without understanding what I do.

I fail to see where the thread you link to is relevant :confused: I see that the transpose is necessary there to have the images as column vectors – but that’s not what we want in the programming exercise, is it? (Don’t we want to keep m as the first dimension?)

With what I’ve pieced together, I now have this:

a_C_unrolled = tf.transpose(tf.reshape(a_C, shape=[_, n_H * n_W, n_C]), perm=[0, 2, 1])

As far as I understand, this gives a (m, n_C, n_H*n_W)-dimensional matrix. This passes the test in compute_content_cost() Can you confirm that this is correct? Then I’ll redact the code again.

But if I do the same in compute_layer_style_cost()

a_S = tf.transpose(tf.reshape(a_S, shape=[_, n_H * n_W, n_C]), perm=[0, 2, 1])

… I get an error

InvalidArgumentError: In[0] mismatch In[1] shape: 16 vs. 3: [1,3,16] [16,3,1] 0 0 [Op:BatchMatMulV2]

I hope these code examples (which I will redact once I get it) give you an idea where I am going wrong.

Ok, sorry, I should have gone back and looked at this notebook before answering. It turns out that for the content cost, the “unrolling” is not even necessary. They actually explicitly tell you that in the instructions. You just need the sum of the squares of the differences between all the elements of the two tensors. So that would explain why it doesn’t matter which way you do it for the content cost. Sigh … I implemented it with no reshape or transpose at all and it passes.

For the Style Cost, it is a lot more complicated. You first need to compute the Gram matrix and doing that requires that you “unroll” the height and width dimensions to get a matrix of shape n_C x n_H * n_W. In order to get that result, we need a reshape followed by the transpose and the reshape needs to preserve the “channels” dimension, which is the last dimension of the inputs here. That is necessary because the Gram matrix is basically a “correlation” matrix for the channels (filters) of the images.

I was pointing to that not as an exact literal duplicate of what we are doing here, but an example showing a) how reshape really works and b) what it means to preserve or not preserve one of the dimensions. In that case it was the 0-th dimension, but here it is the last dimension.

So if you really wanted to understand what is happening here, you could do what I did on that thread: create a “play” tensor with the elements containing their coordinates and then do the “reshape” plus “transpose” and compare the results to what happens if you just “reshape” directly to the final shape you want. The data will be scrambled, but in a different way because of the order of the dimensions.

Also note that we only handle 1 sample at a time in these functions. The tensors are actually 5d because there is also a “layers” dimension that results from sampling multiple internal layer activations. But the first 2 dimensions (layers and samples) are always trivial by the time we call one of the cost functions.

For the style content line you show, you want the reshape to produce a 2D matrix and then the transpose works without any permutations. So remove the samples dimension on the output shape and the perm and it should work.

Thanks for your effort, Paul. I’m getting there: I now managed to get the correct implementation and pass the test, but still have a few questions.

So that would explain why it doesn’t matter which way you do it for the content cost.

OK, this makes sense now. Let me summarise: to compute the cost J_content, you just need the sum of the squares of the differences between all the elements of a_C and a_G – so their internal order really does not matter. This means that, technically, both reshape and transpose are not necessary, but if you do one or both of them anyway, it doesn’t affect the result (x+y+z = x+z+y = z+x+y = …).

For the Style Cost, it is a lot more complicated. You first need to compute the Gram matrix and doing that requires that you “unroll” the height and width dimensions to get a matrix of shape (n_C, n_H * n_W).

Ahaaa - the thing that I completely missed (although I understand perfectly well that gram_matrix() takes a 2D matrix as input) is that we should reshape a_S and a_G into 2D matrices. I was retaining the m dimension in both, as I did in compute_content_cost().

Can we “ignore” the m dimension in the reshape because m is always 1 inside compute_layer_style_cost()? I guess using shape=[_ * n_H * n_W, n_C] would also be correct (it passes the test, in any case).

It is still not entirely clear to me why reshape() and transform() need to be used both. As I understand it now,

  • reshape() is used to preserve the n_C dimension and combine the others
  • transpose() is used to switch the dimensions of the reshaped 2D matrix.

I will have to read your linked thread again a few times, and perhaps run that example.

Yes, the point is that you need to do the reshape in a way that “preserves” the channel dimension and does not scramble the data between channels. But then you need the channels dimension to be the first dimension because the Gram matrix is A \cdot A^T and you want it to be the “correlation” of the filters. It requires a transpose to get the channels dimension as the first dimension. If you reshape directly to that requisite shape, instead of doing the two step process, the data becomes garbage because you mix the h and w data across the channels. If you want to understand that, use the same idea that I showed on that thread about “flattening” images to create a “telltale” tensor and then do the two different algorithms and compare the results. That will make it clear why the direct reshape without the transpose doesn’t work.

Here is the experiment that I described. First the function:

# routine to generate a telltale 4D array to play with
def testarray(shape):
    (d1,d2,d3,d4) = shape
    A = np.zeros(shape)

    for ii1 in range(d1):
        for ii2 in range(d2):
            for ii3 in range(d3):
                for ii4 in range(d4):
                    A[ii1,ii2,ii3,ii4] = ii1 * 1000 + ii2 * 100 + ii3 * 10 + ii4 

    return A

So you can see that the value of any position in the tensor is the coordinates in order. Meaning that A[1,2,3,4] = 1234.

Now run that as follows:

# test case for Art Transfer exercise C4W4A2
A = testarray((1,4,2,3))
# The correct way
Ashape1 = np.transpose(np.reshape(A,[-1,3]))
# The wrong way
Ashape2 = np.reshape(A, [3,-1])
# Another correct way to do it
Ashape3 = np.reshape(np.transpose(A, [0,3,1,2]), [3,-1])
    
np.set_printoptions(suppress=True)
print("Ashape1.shape = " + str(Ashape1.shape))
print(Ashape1)

print("Ashape2.shape = " + str(Ashape2.shape))
print(Ashape2)

print("Ashape3.shape = " + str(Ashape3.shape))
print(Ashape3)

And here is the result you get:

Ashape1.shape = (3, 8)
[[  0.  10. 100. 110. 200. 210. 300. 310.]
 [  1.  11. 101. 111. 201. 211. 301. 311.]
 [  2.  12. 102. 112. 202. 212. 302. 312.]]
Ashape2.shape = (3, 8)
[[  0.   1.   2.  10.  11.  12. 100. 101.]
 [102. 110. 111. 112. 200. 201. 202. 210.]
 [211. 212. 300. 301. 302. 310. 311. 312.]]
Ashape3.shape = (3, 8)
[[  0.  10. 100. 110. 200. 210. 300. 310.]
 [  1.  11. 101. 111. 201. 211. 301. 311.]
 [  2.  12. 102. 112. 202. 212. 302. 312.]]

So we have 3 as the first dimension in all three cases, but read across each row and look at the last dimension of the values. In Ashape1 and Ashape3, they are consistent across the row: 0 for the first row, 1 for the second row and 2 for the third row.

But look at Ashape2: the rows are all a mix of elements from different channels, right? That’s what I meant by “scrambling” the data.

Note that I did all the above in numpy, just because I already had the numpy implementation available from the earlier “flatten” thread. But if you rewrite this using TF, you’ll get exactly the same behavior.

2 Likes

Thanks a lot, Paul. I went through your example over the weekend, and I think I get it now.