Convolution_model_Step_by_Step_v1 Exercise - 1

I am confused about the first exercise. I ended up getting the answer correct but I do not understand conceptually why it is correct.

So the test case is initialized as. x.shape = (4, 3, 3, 2)

x ends up looking like
array([[[[ 1.62434536, -0.61175641],
[-0.52817175, -1.07296862],
[ 0.86540763, -2.3015387 ]],

    [[ 1.74481176, -0.7612069 ],
     [ 0.3190391 , -0.24937038],
     [ 1.46210794, -2.06014071]],

    [[-0.3224172 , -0.38405435],
     [ 1.13376944, -1.09989127],
     [-0.17242821, -0.87785842]]],


   [[[ 0.04221375,  0.58281521],
     [-1.10061918,  1.14472371],
     [ 0.90159072,  0.50249434]],

    [[ 0.90085595, -0.68372786],
     [-0.12289023, -0.93576943],
     [-0.26788808,  0.53035547]],

    [[-0.69166075, -0.39675353],
     [-0.6871727 , -0.84520564],
     [-0.67124613, -0.0126646 ]]],


   [[[-1.11731035,  0.2344157 ],
     [ 1.65980218,  0.74204416],
     [-0.19183555, -0.88762896]],

    [[-0.74715829,  1.6924546 ],
     [ 0.05080775, -0.63699565],
     [ 0.19091548,  2.10025514]],

    [[ 0.12015895,  0.61720311],
     [ 0.30017032, -0.35224985],
     [-1.1425182 , -0.34934272]]],


   [[[-0.20889423,  0.58662319],
     [ 0.83898341,  0.93110208],
     [ 0.28558733,  0.88514116]],

    [[-0.75439794,  1.25286816],
     [ 0.51292982, -0.29809284],
     [ 0.48851815, -0.07557171]],

    [[ 1.13162939,  1.51981682],
     [ 2.18557541, -1.39649634],
     [-1.44411381, -0.50446586]]]])

I understand how 4 here would technically represent the number of training examples here, both of the 3’s represent that each of the 2 channels are 3x3. But these desired properties of the answer do not seem to be reflected in x.shape. Maybe I am misinterpreting.

Any guidance to understanding this would be amazing.

Yes, that’s the way to visualize what a 4 x 3 x 3 x 2 tensor represents: think of it as 4 samples. And each sample is a 3 x 3 x 2 “image”. So it has 3 x 3 pixels and each “pixel” has two values.

Now the question is how to map that to the output of that “print” statement. Understanding the square brackets is the key. At the outer level, there are 4 “groups” of arrays. Within each group, there are 3 elements, each of which is 3 x 2. So they sort of “peel” the dimensions from the left end, which doesn’t exactly map to our “geometric” interpretation of it as a collection of 4 images.

why can’t we just store the data as a (4, 2, 3, 3) shape to match the geometric interpretation?

You could do that, but the conventional way to store images is m x h x w x c. When you’re the boss, you can do it your way, but Prof Ng chooses the other way.

If you do it your way, then it’s like viewing two monocolor images one after the other. But the other way is viewing it as a 2D array of pixels with the “depth” being the color. Which you consider more intuitive is a personal thing, but the conventional way is the way that Prof Ng does it.