C4 W1 Assignment 1 - intuition of training data

For Exercise 3 - conv_forward, I’m trying to develop an intuition of why the training data A_prev vectors are strctured in the way it is if it’s suppose to training data of images. My understanding of the shape of an image is (pixel_height, pixel_width, 3). The 3 being a representation of each number of the RGB color code.

The shape of A_prev is (m, n_H_prev, n_W_prev, n_C_prev) which in this example is (2,5,7,4) where 2 would present the number of training examples, 5 is the pixel height, 7 is the pixel width, and 4 being the number of channels.

For example, say I have training data with the following shape (1, 4, 4, 3)


However for the A_prev example the vector looks totally different from what I would expect an image vector to look like (see cut and paste of A_prev below)

A_prev :[[[[ 1.62434536e+00 -6.11756414e-01 -5.28171752e-01 -1.07296862e+00]
   [ 8.65407629e-01 -2.30153870e+00  1.74481176e+00 -7.61206901e-01]
   [ 3.19039096e-01 -2.49370375e-01  1.46210794e+00 -2.06014071e+00]
   [-3.22417204e-01 -3.84054355e-01  1.13376944e+00 -1.09989127e+00]
   [-1.72428208e-01 -8.77858418e-01  4.22137467e-02  5.82815214e-01]
   [-1.10061918e+00  1.14472371e+00  9.01590721e-01  5.02494339e-01]
   [ 9.00855949e-01 -6.83727859e-01 -1.22890226e-01 -9.35769434e-01]]

  [[-2.67888080e-01  5.30355467e-01 -6.91660752e-01 -3.96753527e-01]
   [-6.87172700e-01 -8.45205641e-01 -6.71246131e-01 -1.26645989e-02]
   [-1.11731035e+00  2.34415698e-01  1.65980218e+00  7.42044161e-01]
   [-1.91835552e-01 -8.87628964e-01 -7.47158294e-01  1.69245460e+00]
   [ 5.08077548e-02 -6.36995647e-01  1.90915485e-01  2.10025514e+00]
   [ 1.20158952e-01  6.17203110e-01  3.00170320e-01 -3.52249846e-01]
   [-1.14251820e+00 -3.49342722e-01 -2.08894233e-01  5.86623191e-01]]

  [[ 8.38983414e-01  9.31102081e-01  2.85587325e-01  8.85141164e-01]
   [-7.54397941e-01  1.25286816e+00  5.12929820e-01 -2.98092835e-01]
   [ 4.88518147e-01 -7.55717130e-02  1.13162939e+00  1.51981682e+00]
   [ 2.18557541e+00 -1.39649634e+00 -1.44411381e+00 -5.04465863e-01]
   [ 1.60037069e-01  8.76168921e-01  3.15634947e-01 -2.02220122e+00]
   [-3.06204013e-01  8.27974643e-01  2.30094735e-01  7.62011180e-01]
   [-2.22328143e-01 -2.00758069e-01  1.86561391e-01  4.10051647e-01]]

  [[ 1.98299720e-01  1.19008646e-01 -6.70662286e-01  3.77563786e-01]
   [ 1.21821271e-01  1.12948391e+00  1.19891788e+00  1.85156417e-01]
   [-3.75284950e-01 -6.38730407e-01  4.23494354e-01  7.73400683e-02]
   [-3.43853676e-01  4.35968568e-02 -6.20000844e-01  6.98032034e-01]
   [-4.47128565e-01  1.22450770e+00  4.03491642e-01  5.93578523e-01]
   [-1.09491185e+00  1.69382433e-01  7.40556451e-01 -9.53700602e-01]
   [-2.66218506e-01  3.26145467e-02 -1.37311732e+00  3.15159392e-01]]

  [[ 8.46160648e-01 -8.59515941e-01  3.50545979e-01 -1.31228341e+00]
   [-3.86955093e-02 -1.61577235e+00  1.12141771e+00  4.08900538e-01]
   [-2.46169559e-02 -7.75161619e-01  1.27375593e+00  1.96710175e+00]
   [-1.85798186e+00  1.23616403e+00  1.62765075e+00  3.38011697e-01]
   [-1.19926803e+00  8.63345318e-01 -1.80920302e-01 -6.03920628e-01]
   [-1.23005814e+00  5.50537496e-01  7.92806866e-01 -6.23530730e-01]
   [ 5.20576337e-01 -1.14434139e+00  8.01861032e-01  4.65672984e-02]]]

 [[[-1.86569772e-01 -1.01745873e-01  8.68886157e-01  7.50411640e-01]
   [ 5.29465324e-01  1.37701210e-01  7.78211279e-02  6.18380262e-01]
   [ 2.32494559e-01  6.82551407e-01 -3.10116774e-01 -2.43483776e+00]
   [ 1.03882460e+00  2.18697965e+00  4.41364444e-01 -1.00155233e-01]
   [-1.36444744e-01 -1.19054188e-01  1.74094083e-02 -1.12201873e+00]
   [-5.17094458e-01 -9.97026828e-01  2.48799161e-01 -2.96641152e-01]
   [ 4.95211324e-01 -1.74703160e-01  9.86335188e-01  2.13533901e-01]]

  [[ 2.19069973e+00 -1.89636092e+00 -6.46916688e-01  9.01486892e-01]
   [ 2.52832571e+00 -2.48634778e-01  4.36689932e-02 -2.26314243e-01]
   [ 1.33145711e+00 -2.87307863e-01  6.80069840e-01 -3.19801599e-01]
   [-1.27255876e+00  3.13547720e-01  5.03184813e-01  1.29322588e+00]
   [-1.10447026e-01 -6.17362064e-01  5.62761097e-01  2.40737092e-01]
   [ 2.80665077e-01 -7.31127037e-02  1.16033857e+00  3.69492716e-01]
   [ 1.90465871e+00  1.11105670e+00  6.59049796e-01 -1.62743834e+00]]

  [[ 6.02319280e-01  4.20282204e-01  8.10951673e-01  1.04444209e+00]
   [-4.00878192e-01  8.24005618e-01 -5.62305431e-01  1.95487808e+00]
   [-1.33195167e+00 -1.76068856e+00 -1.65072127e+00 -8.90555584e-01]
   [-1.11911540e+00  1.95607890e+00 -3.26499498e-01 -1.34267579e+00]
   [ 1.11438298e+00 -5.86523939e-01 -1.23685338e+00  8.75838928e-01]
   [ 6.23362177e-01 -4.34956683e-01  1.40754000e+00  1.29101580e-01]
   [ 1.61694960e+00  5.02740882e-01  1.55880554e+00  1.09402696e-01]]

  [[-1.21974440e+00  2.44936865e+00 -5.45774168e-01 -1.98837863e-01]
   [-7.00398505e-01 -2.03394449e-01  2.42669441e-01  2.01830179e-01]
   [ 6.61020288e-01  1.79215821e+00 -1.20464572e-01 -1.23312074e+00]
   [-1.18231813e+00 -6.65754518e-01 -1.67419581e+00  8.25029824e-01]
   [-4.98213564e-01 -3.10984978e-01 -1.89148284e-03 -1.39662042e+00]
   [-8.61316361e-01  6.74711526e-01  6.18539131e-01 -4.43171931e-01]
   [ 1.81053491e+00 -1.30572692e+00 -3.44987210e-01 -2.30839743e-01]]

  [[-2.79308500e+00  1.93752881e+00  3.66332015e-01 -1.04458938e+00]
   [ 2.05117344e+00  5.85662000e-01  4.29526140e-01 -6.06998398e-01]
   [ 1.06222724e-01 -1.52568032e+00  7.95026094e-01 -3.74438319e-01]
   [ 1.34048197e-01  1.20205486e+00  2.84748111e-01  2.62467445e-01]
   [ 2.76499305e-01 -7.33271604e-01  8.36004719e-01  1.54335911e+00]
   [ 7.58805660e-01  8.84908814e-01 -8.77281519e-01 -8.67787223e-01]
   [-1.44087602e+00  1.23225307e+00 -2.54179868e-01  1.39984394e+00]]]]

I guess what I’m asking is how does each row, column, or value in the vector map to a picture visually. I’m struggling to understand how A_prev is a vector of a picture when it is structured in a totally different way from what we’ve seen before. Thank you so much.

1 Like

Hi @PD_Vaillancourt.

I’m not 100% sure if I get your point. It’s also a couple of months I did this specialization :slight_smile:.

For TensorFlow the order is by a convention as you’ve described [batch_size, height, width, channels] (Note: It can be different for some other libraries).

You can just check the shape of your A_prev to check whether it’s the same as what you’d expect. It has to be 1:1, so every channel has dimension height x width (in other words dimension of matrix for every channel is height x width).

If your are confused only because of you just see the real numbers, this is due to normalization or standardization.

Maybe it’s also a good start to do it w/ grayscale image where you have just one channel.


1 Like

Yes, this is much easier to get a mental grip on if you’re using grayscale images, then each image is a single 2D matrix.

1 Like

Yes, looking at greyscale images may help a bit, but the real problem here is that the print statement in python works in a way that may conflict with the most intuitive way to visualize what a given array actually means in terms of its real geometry. They just “unroll” enough dimensions so that they can then print the last 2 explicitly. So in your case the dimensions of A_prev are (2, 5, 7, 4). Take another look at your print out and notice that what you see is 2 groups, each of which consist of 5 matrices each of which is 7 x 4, right? That’s what I mean by how it does the unrolling.

So that means that the “channels” dimension is enumerated across each of the rows there, which is completely different than the picture you showed of a 3 channel colored image, where they did the 2 x 2 depiction on the pixel dimensions and then handled the channel dimension by stacking three squares of pixels. That’s a much better way to visualize things, but the print statement has to handle the arbitrary case, so it can’t make any assumptions about the geometric meaning of the dimensions.

One general comment here is that when you look at the input RGB image, there are 3 channels which are the color values. But once you get past the input layer of the network, the channel dimension will typically not be 3 and you can no longer think of the values as “colors”. They are just real numbers which represent some information that the previous layers have derived or distilled from the input values.

Here’s a thread which shows some experiments that are a bit easier to interpret to show how this is all working. Note that was triggered by a question about trying to interpret the printed output in the “padding” function section earlier in that same “Conv Step by Step” assignment.


Hello @PD_Vaillancourt

Just for this point, I think this is a one-time exercise - we dig deep in one or a few tensor examples and we call it done once the examples are clear.

A good example will be a small tensor of a different shape in each dimension, e.g. 2 x 3 x 4 x 5, and the tensor’s elements are incremental integers. In this way, there is no ambiguity along the way of inspecting it. In this way, it becomes easy for us to see how it arranges the numbers along different dimensions.

For example:

images = np.arange(2*3*4*5).reshape((2,3,4,5))

which will give us

array([[[[  0,   1,   2,   3,   4],
         [  5,   6,   7,   8,   9],
         [ 10,  11,  12,  13,  14],
         [ 15,  16,  17,  18,  19]],

        [[ 20,  21,  22,  23,  24],
         [ 25,  26,  27,  28,  29],
         [ 30,  31,  32,  33,  34],
         [ 35,  36,  37,  38,  39]],

        [[ 40,  41,  42,  43,  44],
         [ 45,  46,  47,  48,  49],
         [ 50,  51,  52,  53,  54],
         [ 55,  56,  57,  58,  59]]],

       [[[ 60,  61,  62,  63,  64],
         [ 65,  66,  67,  68,  69],
         [ 70,  71,  72,  73,  74],
         [ 75,  76,  77,  78,  79]],

        [[ 80,  81,  82,  83,  84],
         [ 85,  86,  87,  88,  89],
         [ 90,  91,  92,  93,  94],
         [ 95,  96,  97,  98,  99]],

        [[100, 101, 102, 103, 104],
         [105, 106, 107, 108, 109],
         [110, 111, 112, 113, 114],
         [115, 116, 117, 118, 119]]]])

then I would start to inspect this thing with indexing. Like, taking “one image out” with images[1] and check if it meets expectation. Or, take “one channel out of an image” with images[0, :, :, 3], then check. With a couple of these checks, hopefully, it will help you develop some intuition.

In real work, whenever I need to inspect a tensor, indexing always helps reduce the dimensions to where I and my visual system are more comfortable with.


1 Like

For example, with the above, you can immediately see where the 2, 3, 4, and 5 are:

This is, of course, not intuitive, because we probably wanted to see something like the below (agree?):

which means the following changes in representation:

I hope you see how the change of representation implements the more intuitive way.

The first representation is called “Channel last”, and the second called “Channel first”. Both have their fans. Tensorflow defaults to Channel last.

Btw, @PD_Vaillancourt, if you think the Channel-first approach (aka NCHW) is more intuitive to you, then instead of retraining your brain to adapt to the Channel-last (aka NHWC), you can convert the tensor with transpose (NOT reshape).

Transpose swaps the dimensions, and now, in below, we have the more intuitive representation of the same data.

Two posts before, I suggested you to do image[0, :, :, 3] to index the 0th image’s 3rd channel. That result will now show up as one of the ten blocks in the new representation. Have fun :wink:


1 Like

Really helpful. I did notice how each dimension was represented in the python print but it just seemed like an odd place given how we think about how a picture’s dimensions are in a 2D plane and since the height and width values of the image was spread across different vectors it just threw me off. I guess as long as there is a value for each of the dimensions then you have the data you need. I wonder if the way python unrolls the vectors would impact the result of a math operation with another vector.

this is incredibly helpful. thank you. I do have one though what do each of the values of image.transpose(0,3,1,2) represent?

Hey @PD_Vaillancourt,

They are the mappings. So, 0, 3, 1, 2 are:

0 → 0
3 → 1
1 → 2
2 → 3

With this mapping, for example, you map the 3rd dimension of the input array to the 1st dimension of the output array, then 1st to 2nd, and finally 2nd to 3rd.


1 Like

Well the point here is purely about how the “print” functions works. It’s got nothing to do with math, right? At least in my part of the conversation.

But if you start doing the transposes as Raymond is talking about, then you have modified the actual arrangement of the data, so that absolutely affects how you specify other operations. You need to keep that in mind and handle everything consistently if you decide to “go there”.

My recommendation is that you stick with the m x h x w x c structure that they are using here to avoid further confusion and having to rewrite everything.