C1_M3_Lab_data_management - Error when I iterate through FlowerDataset object

Hello,

I’ve downloaded both the notebooks and the data on my local computer to dig into the data and the pytorch functions.
While trying to replicate the notebooks I ran into an error that I don’t understand.

Here’s my code (~same as the lab notebook).


Imports

import os
import tarfile
import requests
import scipy
from tqdm import tqdm
from PIL import Image
from torch.utils.data import Dataset

Data download

data_dir = "./flower_data"
img_folder_path = os.path.join(data_dir, 'jpg')
labels_file_path = os.path.join(data_dir, 'imagelabels.mat')
tgz_path = os.path.join(data_dir, '102flowers.tgz')
os.makedirs(data_dir, exist_ok=True)

image_url = "https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz"
labels_url = "https://www.robots.ox.ac.uk/~vgg/data/flowers/102/imagelabels.mat"

response = requests.get(image_url, stream=True)
total_size = int(response.headers.get("content-length", 0))
with open(tgz_path, "wb") as file:
    for data in tqdm(
        response.iter_content(chunk_size=1024),
        total=total_size // 1024,
    ):
        file.write(data)

with tarfile.open(tgz_path, "r:gz") as tar:
    tar.extractall(data_dir)

response = requests.get(labels_url)
with open(labels_file_path, 'wb') as file:
    file.write(response.content)

FlowerDataset definition

class FlowerDataset(Dataset): 
    def __init__(self, root_dir, tranform=None):
        self.root_dir = root_dir
        self.tranform = tranform
        self.img_dir = os.path.join(self.root_dir, "jpg")
        self.labels = self.load_and_correct_labels()

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        image = self.retrieve_image(idx)
        if self.tranform:
            image = self.tranform(image)
        label = self.labels[idx]
        return image, label

    def retrieve_image(self, idx):
        img_name = f"image_{idx+1:05d}.jpg"
        img_path = os.path.join(self.img_dir, img_name)
        with Image.open(img_path) as img:
            image = img.convert("RGB")
        return image

    def load_and_correct_labels(self):
        self.labels_mat = scipy.io.loadmat(
            os.path.join(self.root_dir, 'imagelabels.mat')
        )
        labels = self.labels_mat['labels'][0] - 1
        return labels

Loop through the dataset

dataset = FlowerDataset(data_dir)
for _ in dataset:
    pass

When I do this I get the following error:

[Errno 2] No such file or directory: ‘./flower_data\jpg\image_08190.jpg’

I’ve checked the length of the dataset object and it is 8189

print(len(dataset))
# 8189

→ But, it seems that the iteration goes from 0 to 8189 instead of 8188.

I tried a few AIs to troubleshoot but they did not seem to find where it comes from. They ended up by giving me this to avoid the error:

def __getitem__(self, idx):
    if idx >= len(self.labels):
        raise IndexError(f"Index {idx} out of range for dataset of length {len(self.labels)}")
    image = self.retrieve_image(idx)
    if self.tranform:
        image = self.tranform(image)
    label = self.labels[idx]
    return image, label

Is there something I’m missing ?

After looking into it a bit more, this is standard Python behavior.

When we implement __getitem__, Python’s iteration protocol requires us to raise IndexError when the index is out of bounds, that’s how Python knows when to stop iterating.

Without raise IndexError, Python doesn’t know when to stop and will keep calling __getitem__ with incrementing indices, causing the code to break (either with an error or unexpected behavior).

It’s how all Python sequences work (lists, tuples, etc.). If we want iteration to work, we need to raise IndexError, it is mandatory.

I thought that python would use __len__ to define the boundaries of the loop but that is not the case.

I’m not convinced we have a complete answer at this point. :nerd_face: If this is standard behavior, then why do you need to code it differently when you run locally?

(Assuming that the dataset contains multiple images).

Perhaps what this error means is that the dataset is missing one specific image, and has nothing to do with the length of the dataset or how Ppython indexing works.

This part below is not how it’s done in the lab:

dataset = FlowerDataset(data_dir)
for image, label in dataset:
    pass

In the lab, you actually use a DataLoader to iterate through the dataset by batches:

dataset = FlowerDataset(data_dir)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

for batch_images, batch_labels in dataloader:
    pass

This works fine in the lab.

The reason I iterated directly over the dataset (without a DataLoader) was to compute the mean and standard deviation of the images for normalization.
In the lab, this step isn’t needed because the mean and standard deviation are already pre-computed. I just did it manually as an extra exercise.

So, if someone runs the lab as provided, they won’t encounter any errors.

I’ve checked, and (surprisingly) this is how Python indexing works.
For example, if you run the following code:

class MyList:
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        print(idx)
        return self.data[idx]

my_list = MyList([1, 2, 3])

for i in my_list:
    pass

You’ll get this output:

0
1
2
3

Since the object’s length is 3, you might expect the indices to go only from 0 to 2.
However, Python’s iteration protocol works by calling __getitem__ repeatedly with increasing indices (0, 1, 2, …) until it encounters an IndexError.
When idx = 3 raises an IndexError, Python silently catches it to signal the end of the iteration, so the loop stops gracefully.

If you raise another type of exception instead, such as FileNotFoundError, it won’t be caught and the code will crash:

class MyList:
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        if idx >= len(self.data):
            raise FileNotFoundError(f"Index {idx} out of bounds")
        return self.data[idx]

my_list = MyList([1, 2, 3])

for i in my_list:
    print(i)

This raises a FileNotFoundError because it’s not the special signal Python uses to end iteration..

:white_check_mark: Two possible solutions:

  • Raise an IndexError in the __getitem__ method (as expected by Python’s iteration protocol).
  • Implement a custom __iter__ method, which is the standard and more explicit way to make an object iterable.

There are 8189 images in the dataset.

$ ls -lh *.jpg | wc -l                
8189

The 1st image name is image_00001.jpg and the last image name is image_08189.jpg. There are no incorrectly named / missing images.