# Context

I am a geophysicist and have worked with three dimensional images similar to ultrasounds. I would assume you could use this technique of volume cross-correlation to assist with image interpretation. It seems advantageous to be able to have multidimensional outputs from a singular filter type.

# Question

In the course content, Andrew explains that when working with a volume for convolution/cross-correlation we are to make the third dimension the same value between the signal matrix and the filter matrix (e.g. 6x6x3 and 3x3x3). The output is a 2D matrix. Is there an issue with making the third dimension of the filter of varying value from the signal (e.g. 6x6x6 and 3x3x3)?

Iâ€™m not sure what you mean by â€śvolume cross-correlationâ€ť, but I can explain a bit more about how Convolutional Filters work. The point is that at each layer of the network, the inputs and outputs are 4D arrays or â€śtensorsâ€ť. The definition or order of the dimensions can be considered arbitrary, but Prof Ng has chosen one of the popular arrangements:

samples, height, width, channels

For the purposes of discussing how filters work, we handle each sample individually, so that is a 3D array with dimensions:

h x w x c

where c is called the â€śchannelâ€ť dimension. Of course for images as the input, there are 3 â€śchannelsâ€ť representing the RGB pixel values of the images. If the images are in other representations, sometimes youâ€™ll see an Alpha channel included (e.g. with PNG files).

If you read through the material in the â€śStep by Stepâ€ť assignment in Week 1 of this course (ConvNets), youâ€™ll get a more thorough picture of all this, but notice that the shape of the W^{[l]} filters array (â€śweightsâ€ť) has the following dimensions:

f x f x nC_{prev} x nC

Where f is the filter size and nC_{prev} is the number of channels in the input. The next key point is that nC is the number of output channels for the layer in question. The easiest way to think of this is that for each output channel you want to create at a given layer, there is one filter that has the same number of channels as the input. By now, Iâ€™m sure you can see where this is going:

You have a 4 level nested loop that steps through the dimensions in this order:

1. Samples
2. Output height
3. Output width
4. Output channels

At each iteration of the innermost loop, you are applying a filter of size f x f x nC_{prev} to a particular position in the input space to compute one element of the output space. So you are converting a 3D input to a 2D output, but you are doing it â€śper output channelâ€ť. The result is that you are â€śstackingâ€ť the 2D outputs to create a 3D output. And remember that this is â€śper sampleâ€ť, so the end result is 4D.

A couple more general points:

The choice of the number of output channels at a given layer of the network is a â€śhyperparameterâ€ť, meaning that you simply have to choose it and then decide through testing whether the choice you made is a good one or not.

Also note that the output channels at any given layer are independent of each other and it does not make sense to think of the values as â€ścolorsâ€ť any more. They are just real numbers and the meaning doesnâ€™t really become apparent until you put it all together at the very end to come up with an answer. Just think of the values as â€śsignalsâ€ť in the sense of signal processing. Prof Ng will present some really cool work in Week 4 that shows how to interpret those signals in the hidden layers of a network. See the lecture â€śWhat are Deep ConvNets Learning?â€ť.

Then finally to what is perhaps the high level point you are actually discussing: if you have inputs that are actually â€śvolumetricâ€ť like a CT scan, then I assume that the individual images are 4D for each individual sample. So you end up with a 5D array where the dimensions for a single image are:

h x w x d x nC

where d is the pixel (â€śvoxelâ€ť) depth. The first question is what the voxel values represent. In the case of medical images like CTs, I think they are typically a single gray scale value representing density, so in that case nC would be 1. The logical extension of the 3D technique would be to use filters of shape

f x f x f x nC

You could use the same idea by having multiple such filters and â€śstackingâ€ť the outputs as in the 3D case.

Prof Ng does briefly discuss convolution over volumes later in the course (Week 4 I think) and points out that you can extend all these techniques from 3D to 4D inputs, but he doesnâ€™t go so far as to implement any of that in this course. The obvious way would be to use the same technique of stacking filters, as I sketched above. The other way to add another dimension is to consider successive video frames of a movie instead of single still images. I have no experience in either of those higher dimensional cases, so canâ€™t really say anything useful. It would be worth â€śholding that thoughtâ€ť to see what Prof Ng says in Week 4 and then do some googling or see if he provides any references.