Volume Cross-Correlation Filter Depth

Context

I am a geophysicist and have worked with three dimensional images similar to ultrasounds. I would assume you could use this technique of volume cross-correlation to assist with image interpretation. It seems advantageous to be able to have multidimensional outputs from a singular filter type.

Question

In the course content, Andrew explains that when working with a volume for convolution/cross-correlation we are to make the third dimension the same value between the signal matrix and the filter matrix (e.g. 6x6x3 and 3x3x3). The output is a 2D matrix. Is there an issue with making the third dimension of the filter of varying value from the signal (e.g. 6x6x6 and 3x3x3)?

I’m not sure what you mean by “volume cross-correlation”, but I can explain a bit more about how Convolutional Filters work. The point is that at each layer of the network, the inputs and outputs are 4D arrays or “tensors”. The definition or order of the dimensions can be considered arbitrary, but Prof Ng has chosen one of the popular arrangements:

samples, height, width, channels

For the purposes of discussing how filters work, we handle each sample individually, so that is a 3D array with dimensions:

h x w x c

where c is called the “channel” dimension. Of course for images as the input, there are 3 “channels” representing the RGB pixel values of the images. If the images are in other representations, sometimes you’ll see an Alpha channel included (e.g. with PNG files).

If you read through the material in the “Step by Step” assignment in Week 1 of this course (ConvNets), you’ll get a more thorough picture of all this, but notice that the shape of the W^{[l]} filters array (“weights”) has the following dimensions:

f x f x nC_{prev} x nC

Where f is the filter size and nC_{prev} is the number of channels in the input. The next key point is that nC is the number of output channels for the layer in question. The easiest way to think of this is that for each output channel you want to create at a given layer, there is one filter that has the same number of channels as the input. By now, I’m sure you can see where this is going:

You have a 4 level nested loop that steps through the dimensions in this order:

  1. Samples
  2. Output height
  3. Output width
  4. Output channels

At each iteration of the innermost loop, you are applying a filter of size f x f x nC_{prev} to a particular position in the input space to compute one element of the output space. So you are converting a 3D input to a 2D output, but you are doing it “per output channel”. The result is that you are “stacking” the 2D outputs to create a 3D output. And remember that this is “per sample”, so the end result is 4D.

A couple more general points:

The choice of the number of output channels at a given layer of the network is a “hyperparameter”, meaning that you simply have to choose it and then decide through testing whether the choice you made is a good one or not.

Also note that the output channels at any given layer are independent of each other and it does not make sense to think of the values as “colors” any more. They are just real numbers and the meaning doesn’t really become apparent until you put it all together at the very end to come up with an answer. Just think of the values as “signals” in the sense of signal processing. Prof Ng will present some really cool work in Week 4 that shows how to interpret those signals in the hidden layers of a network. See the lecture “What are Deep ConvNets Learning?”.

Then finally to what is perhaps the high level point you are actually discussing: if you have inputs that are actually “volumetric” like a CT scan, then I assume that the individual images are 4D for each individual sample. So you end up with a 5D array where the dimensions for a single image are:

h x w x d x nC

where d is the pixel (“voxel”) depth. The first question is what the voxel values represent. In the case of medical images like CTs, I think they are typically a single gray scale value representing density, so in that case nC would be 1. The logical extension of the 3D technique would be to use filters of shape

f x f x f x nC

You could use the same idea by having multiple such filters and “stacking” the outputs as in the 3D case.

Prof Ng does briefly discuss convolution over volumes later in the course (Week 4 I think) and points out that you can extend all these techniques from 3D to 4D inputs, but he doesn’t go so far as to implement any of that in this course. The obvious way would be to use the same technique of stacking filters, as I sketched above. The other way to add another dimension is to consider successive video frames of a movie instead of single still images. I have no experience in either of those higher dimensional cases, so can’t really say anything useful. It would be worth “holding that thought” to see what Prof Ng says in Week 4 and then do some googling or see if he provides any references.

If I’ve just completely missed your point here, please follow up with some additional discussion! :nerd_face:

1 Like

Thank you @paulinpaloalto! I’m still formulating the inferences of your points as well as from my own posed thought. I will finish this months course and return to this discussion better ripped to talk about it because I think this is a highly interesting and powerful area of deep learning (in my novice opinion).

1 Like

Sounds good! It’s definitely a good idea to go through the entire course and hear everything that Prof Ng covers. I’m sure that will answer some of the issues you raised above, but there will be plenty more to discuss.