Data Augmentation/ Collection

I have two questions:

  • I’ve been wondering the whole time throughout the Deep Learning Specialization about the resolution of the images. All examples are consistently the same input size. Is it valid to have mixed resolution data and these networks perform equally well? Would it even augment the network if we had a variety of resolutions training the network? If valid, do we pad the input as needed or do we need to convert it in some other way to function in the network? Maybe Data Collection, Hygiene and Cleaning is another field in itself?

  • Instead of using RGB files, what are the implications of feeding the network raw data from the Camera sensors? Would the same CNN Filters apply? I would think there would be more of a need for more Fully Connected Layers? Would it even be practical to do that? It would be a lot more input data and possibly even a lot of irrelevant noise? But this goes back to the color shifting augmentation , which seems to be a pseudo way of introducing “raw” sensor data.

For any particular network, you have to decide on the exact size and type of the images before you train the network. In the case of the Fully Connected nets we learned about in DLS C1 and C2, the network has to have a well defined and fixed input vector size. In the case of a ConvNet, you have a little more potential flexibility, but most of the ConvNet architectures you see in DLS C4 you’ll notice end with a few Fully Connected layers followed by the final classifier output. So those also require a fixed size. In terms of the types of images, they also need to be of one type: color RGB or greyscale or …

In the case of a ConvNet, the filters are smaller than the image size and you are stepping them across and down the images, so it would produce an output even if the inputs are different sizes. But the intermediate sizes will also be different and at the point that you finally convert to an FC layer, you are stuck and need a fixed input size.

I don’t think there is a difference between the case of taking inputs from a file or directly from a camera, but you need to train the network on the same type of images that the camera produces (size and type). Cameras produce either either color or greyscale images of a certain size, right? But you typically find that networks are trained on “downsampled” images for resource reasons. So you might need a preprocessing layer that uses an image library to downsample or convert the camera outputs into your defined input format.

You could use a similar technique to handle different sized images: a preprocessing layer to convert them into whatever the standard size and type is that you have trained you network to handle. But then that “image preprocessing” step is a separate thing that you perform once on each batch of images and is not really part of the “network”.

Lets say we are training a network that is accepting X input of resolution 2400 x 2400. Some training inputs have cats that are in the background and some that are in the foreground. Some have small cats and some have big cats(of various breeds) at varied angles. Then we have some images of lower resolution that have been padded to the 2400 x 2400.

  • How does this variance impact our learning rate and accuracy? Does this mean we will need more edge filters and a bigger training set? Asking because it may be hard to acquire the specific training dataset that we require, but also have the goal of making a more robust cat detector that can identify the cat in the background even when it is not a frontal image of the cat. Will the network learn to identify and detect the cat by its tail or its paws or any other features given all of these variations?

Since I have an interest in photography… I understand that raw camera sensor data is not in RGB or grayscale. It is typically mosaic and then put through some demosaic algorithm that determines edges and extra coloring where assumptions are made to make the final RGB picture. During this process original dynamic range data is not preserved and assumptions are made about the photo. And these assumptions and data loss is carried into our network. What I am wondering is how would convolution detect edges in raw photo data where there may be no clear edge as in a RGB photo. In a RGB photo, we have 3 clear channels, not the case in RAW camera data. Along the same lines, can we feed any other(non-image) labelled sensor data into a CNN or plain FC network? Would these function in the same way? Or are CNN strictly for grayscale/RGB?