But I’m confused as to how this is being applied to what seems (to me) a series of pictures of arbitrary size.
Why is it not X_pad = np.pad(X, ((pad, pad), (pad, pad)), 'constant'); ?
It’s a single image (not RGB) … with only two dimensions. Where are the other ‘(0,0)’ terms coming from?
Take another look. The input is a 4D tensor. It tells you that in the “docstring” and other places as well. You could even look at the test code to see the shape of the inputs it is passing you. The height and width dimensions are the 2nd and 3rd of the 4 dimensions. The first dimension is the “samples” dimension. The last dimension is the “channels” dimension, which for an image is the RGB color values.
There is no fixed standard for what the dimensions are in general. If you have a 5D tensor, you’ll need to find the documentation for how it was generated. One such example I can think of would be volumetric medical images like CT scans. There you will have 3 spatial dimensions (height, width and depth) and then the number of channels will be determined by what the outputs of the sensors are. So if you have a batch of such samples, it would end up being a 5D tensor, since each “image” has 4 dimensions and you have multiple such images.
But here in DLS, images are always formatted the same way:
m x h x w x c
Where m is the number of samples (the first dimension), the height h is the second dimension (number of pixels), the width w is the third (number of pixels across) and then c the number of channels (typically 3 for an RGB image, but we sometimes see 4 channels on PNG images) is the last or 4th dimension.
To understand the meaning of the “samples” dimension, the point is we are typically dealing with multiple images in a “batch” of data. So each individual image has 3 dimensions: h, w and c, right? If you want to stack a bunch of images together so that you can deal with them as a batch, you need another dimension. You have two reasonable choices: stack them along a new first dimension or a new last dimension. Prof Ng (and most other people as far as I’ve seen) choose to stack the images so that the first dimension selects the individual samples. So for each value of the first index, you get the ranges of h, w and c for one particular image in the batch. You will see this play out in the “for” loops as you implement conv_forward and pool_forward in the first assignment here in DLS C4 W1.