In the first video of week 2, “Binary Classification”, the images of the training set were turned into a one-dim 64x64 (x3) vector. Doesn’t this lose the geometric relationships between the pixels in a square of pixels, such as the proximity of neighboring pixels, for example, (0,0) and (1,0) are next to each other in the image, but (0) and (64) are not in the vector. Doesn’t this cause the loss of information? I can understand that if the actual image sizes are going to be always the same 64x64, the training would enable the capturing of the geometric relationship between vector (0) and vector (64), but wouldn’t that limit the usefulness of the model only able to apply to same sizes of images?
Yes, that’s a really interesting point. In a sense, you are losing the geometric information when you “flatten” or “unroll” the 3D images into vectors, but it turns out that the algorithms like Logistic Regression and Neural Networks can still learn to recognize the patterns. Well, maybe the more accurate way to say it is that the geometric information is still there, but it is encoded in a way that is no longer obvious to our eyes and the algorithm can still figure it out. It does seem a bit surprising, but we can demonstrate that it works. Of course later in DLS Course 4, we will learn about more powerful architectures for handling images (Convolutional Networks) where the network can accept the full geometric representation of the inputs. But the fully connected nets we are learning about in Course 1 and 2 can still do it.
Another interesting point here is that there are actually different ways to do the flattening and any of them will work, provided that you are consistent in handling all the images the same way. For example, you can unroll RGB images in a way that gives you all three color values of each pixel in order or you can unroll them in such a way that you get all the red pixels, followed by all the green pixels, followed by all the blue pixels. Either works as long as you handle all your data the same way. Here’s a thread which goes into more details on all things about the flattening here. If you read it all the way through, you’ll see more information about the different ways of flattening.
Thanks Paul. I guess even when the image sizes are different, the process of training can enable the NN capture the features in a “scale-free” fashion. How it is able to do that, is probably one of the mysteries why NN can work as well as they can.
With this type of Network, it must be trained on images of a particular size and type (RGB). So you have to decide that up front, before you run the training. Any images you want to feed to the network which are different need to be converted to the format on which the network was trained. You can start over and retrain with a different size of images and it will also work, provided that the images have enough detail in them to “see” whatever it is your goal is to detect.
Ah, so the size does matter. Can I take what you said means that a model trained on a specific size of training images cannot be applied to predict a new image of a different size?
Yes, that’s correct. But you can rescale images. Any reasonable image library includes “resize” as a function.
Thanks very much Paul.
As an example of how to preprocess images to the correct size to use with a trained neural network, you’ll see one way to do that in the “Test with your own image” section at the end of the Week 2 Logistic Regression assignment and then again in Week 4 Assignment 2. Please “stay tuned” for that!
At a higher level, note that any camera these days produces very high resolution images, but that means a lot of data. If you have a training set with thousands or tens of thousands of images, using the full native size of the images will be incredibly costly in terms of both storage and the compute cost of training. So one of the early decisions you need to make when you are designing a new system to solve some image recognition problem is what size you should downsize your images to in order to get performance good enough to solve your problem at a reasonable compute and storage cost. So in the example here, you can see they chose to use 64 x 64 images, which are pretty small. It is interesting when you get to that section of the assignment to show your original image and then show what it looks like after being downsized to 64 x 64. You lose quite a bit of resolution, but it’s still clear whether there is a cat in the picture or not. But for example if you were looking at chest x-ray images and trying to decide if there is any sign of disease, you will probably need to preserve more of the resolution of the original image in order to get good results. Lots of decisions need to be made to get a successful solution to a problem!