About 3d ARRAY's shape in numpy and CNN

In Numpy, there is a fixed rule for the order of axis of 3d array. A (k,b,c) array shape means there are k matrix of b rows and c columns.

But in the assignments and course, the 3D visualizing is still the same, but the order of axis is different.

To select a 2x2 slice at the upper left corner of a matrix “a_prev” (shape (5,5,3)), you would do: a_slice_prev = a_prev[0:2,0:2,:]

In this hint from assignment from course4 week1, the data is represented as three 5x5 matrix in the same way as numpy, because it uses “upper left corner”. However, the order of axis is totally different from Numpy( standard numpy should be (3,5,5) and slicing should be[:,0:2,0:2]).

This difference has gave me a great difficulty when coding and understanding. I know there are similar questions asked but i still need help. What is the right way to think if I want to slice the “upper left corner” of the data? And why this course does not align with numpy.

There is no “standard order of dimensions” in numpy or any other language. You just have to look at the meaning of the data. When dealing with image data, you have the choice of “channels first” or “channels last” orientation, but here we use “channels last”. So the dimensions of a single image are:

height x width x channels

Of course height and width are number of pixels. Channels are the color values for the pixels, so it will be 1 for greyscale images, 3 for RGB images or 4 for CMYK or RGBA images.

Then if you have multiple images, a first dimension is added for the samples. So if you have m images, the array will be 4D:

m x h x w x c

In Course 1, we needed to convert these 4D image arrays to 2D matrices with dimensions n_x x m, where n_x is the number of features and m is the number of samples. Of course for images we have this relationship:

n_x = h * w * c

You can see a detailed discussion of how the “flattening” operation is done to convert from 4D to 2D on this thread.

When we get to ConvNets in Course 4, part of the power of ConvNets comes from the fact that they can handle the original geometric structure of the images: you don’t have to “flatten” them. The networks handle one image at a time, so you select on the first dimension (samples):

oneImage = images[i, :, :, :]

Which gets us back to a single image with dimensions h x w x c. As we go through the layers of the convnets, of course, the h and w and c values will typically change.