Bug Report C1_M3_Lab_data_management.ipynb: Double Transformation and TypeError in SubsetWithTransform Implementation

Hi everyone,

I would like to report a subtle bug in the notebook C1_M3_Lab_data_management.ipynb:

Description

In this notebook, there is a logic conflict between the base FlowerDataset and the SubsetWithTransform wrapper. The base dataset is initialized with a transform pipeline that includes transforms.ToTensor(). When SubsetWithTransform is later applied to the split subsets, it attempts to apply a second transformation pipeline (the augmentation) to an object that has already been converted into a Tensor.

This results in a TypeError because many augmentation transforms (like RandomHorizontalFlip) expect a PIL Image or ndarray, but receive a torch.Tensor.

Steps to Reproduce

  1. Initialize FlowerDataset with a base transform that includes ToTensor().

  2. Split the dataset using random_split.

  3. Wrap the training subsets in SubsetWithTransform using an augmentation pipeline (e.g., RandomHorizontalFlip).

  4. Access an element: train_dataset[0].

Note that this also applies to the validation and test dataset.

Technical Analysis

The issue lies in the nested call stack of the __getitem__ methods:

  1. SubsetWithTransform.__getitem__ calls self.subset[idx].

  2. This triggers the base FlowerDataset.__getitem__, which applies its internal self.transform.

  3. If the base transform includes ToTensor(), the image is returned as a Tensor.

  4. SubsetWithTransform then attempts to apply augmentation_transform to this Tensor, causing the crash.

Error Traceback

TypeError: pic should be PIL Image or ndarray. Got <class 'torch.Tensor'>

Suggested Fix

The base FlowerDataset should be re-initialized with transform=None when it is intended to be split and wrapped by SubsetWithTransform. This ensures that the raw PIL image is passed up the chain, allowing the wrapper to handle all transformations in a single pass.

Recommended Code Change:

# Initialize base dataset without transforms to avoid double-processing
dataset_raw = FlowerDataset(path_dataset, transform=None)

# ... perform split ...

# Apply specific transforms only at the subset level
train_dataset = SubsetWithTransform(train_indices, transform=augmentation_transform)
val_dataset = SubsetWithTransform(val_indices, transform=base_transform)

Note that the bug does not surface during normal execution of the notebook, since the subsets are not accessed after being defined. Nevertheless, I would encourage explicitly re-initializing FlowerDataset with transform=None, as this makes the double-transformation issue visible and helps students build a clearer mental model of the data pipeline.

I hope you find this helpful in further improving the resources of this great course.

Best regards,
Carl

Thanks for reporting this @Carl_Schmidt.

Yes, we are aware of this. This should be fixed in the coming days.

Thanks for the quick reply @Mubsi!

Just saw the other post about this issue.
My bad for creating a duplicate.

No worries.