Storage space of an "unrolled" image vs a normal image

I recently finished the first course, and for fun wanted to try and build my own NN that does the same thing as the cat one, but from scratch. Something I’m curious about is if the unrolled image takes up less, more, or the same space as the normal image. My intuition is telling me that they would be the same, because each pixel value has to be stored, whether or not its in a vector or in an image, but I’m not sure. Any help would be great!

Your intuition is correct: before and after flattening, you have the same number of pixels, so the memory consumed is the same. There is some overhead or “metadata” that python needs to keep track of any variable or “object”. The size of that metadata might be slightly different, but we’re only talking at most a few tens of bytes for the metadata, which is dwarfed by the actual size of all the pixels.

But there are some subtleties to mention. If you take a look in either the Week 2 or Week 4 assignment where we mess with the image data, you can insert statements like this after we’ve done all the preprocessing of the images (here are the commands for Week 2):

print(f"train_set_x_orig.dtype = {train_set_x_orig.dtype}")
print(f"train_set_x_flatten.dtype = {train_set_x_flatten.dtype}")
print(f"train_set_x.dtype = {train_set_x.dtype}")

Here’s what I get when I run that cell:

train_set_x_orig.dtype = uint8
train_set_x_flatten.dtype = uint8
train_set_x.dtype = float64

So you can see that in the original 4D array form, the pixel values are unsigned 8 bit integers. Then they are the same type in the flattened images. So each of those will require 12288 bytes for each image since the images are 64 x 64 x 3.

But then we rescale the 8 bit values to be floating point numbers between 0 and 1 by dividing by 255. and that gives us 64 bit floating point values. Those take … wait for it … 64 bits to store each individual pixel value. So that’s 8 bytes per pixel instead of 1.

All image formats like JPEG, PNG, TIFF and so forth define pixel values as 8 bit unsigned integers with values from 0 to 255. Then you also have the added complexity that file representations typically include some form of compression. So that is another level of complexity, but the question you are asking is about “in memory” representations.

Now in exchange for my effort in attempting to answer the question, you can tell me why this is a big deal to you. You have to deal with the images in flattened form and normalize them in any case, so you really don’t have much choice here. The only fundamental “hyperparameter” level choice you have is the resolution of the images you want to use for your system. Needless to say, any camera these days is going to produce images that have a LOT more than 64 x 64 pixels. So we end up downsampling images that are going to be used as inputs to a NN like this. The question is how far we can go before we lose so much resolution that the network can no longer make the distinctions we need it to make. There is no a priori right answer for any given case: you’ll have to figure that out as part of your system design process.

1 Like

Hey Paul,

I appreciate your effort as always, your answer makes sense. I ask because I wanted to reserve as much space as possible on my computer, and I was thinking through ways of doing that. One thought that came to my head, was maybe if I unrolled the photos in batches, and only held them on my computer long enough to unroll them, I could save space and get as many photos as possible.

I wasn’t sure if it would help, but seeing as this has been the most consistent source for answers, I thought I would check my intuition. I hope this forum is ok for curiosity questions that aren’t directly related to the curriculum. If these off topic questions are off base, please let me know and I’ll be sure to keep the rest on topic.

It’s fine to ask questions about things that are beyond the scope of the course materials. Of course there is no guarantee you’ll get an answer, but it’s worth a try. If it’s a question about how to actually apply the techniques here to “real world” problems, it’s pretty likely you’ll find people interested to discuss that.

In the case here, you’re (I think) talking about optimizing the storage of your input images in a “static” way, meaning sitting on disk. You have lots of options, but you might want to consider taking advantage of the compression features that I described above. A fair amount of effort was invested in defining formats like JPEG precisely for the reason that images are large and disk space is always a limitation. Then anytime you want to use them as input, you have to “unpack” them as the load_images function does in Week 2. Of course there the file format is h5. The other high level point here is that if you really want to apply these techniques, nobody really writes their own code from scratch in python. Everyone uses a “framework” like TensorFlow or PyTorch. You’ll start learning about TF in Course 2 of DLS. Once you convert to that, you can just keep your collection of images as JPEG files in a directory hierarchy and use functions like image_dataset_from_directory to load them into memory.

Sweet! Maybe I’ll just postpone this little project then and do the second course first. Thanks!

Yes, there is a huge amount of interesting and relevant material in the rest of the DLS courses. You’ll want to do at least C2 and C4 to get a good start on TF and how to build working systems in general. Onward! :nerd_face: