Initial evaluation of the CalTech101 dataset

ai_curious · March 15, 2023, 1:01pm

When searching for papers or datasets related to object detection, you are likely to come across references to the Caltech 101 dataset. It is potentially interesting because it has both class labels and object bounding boxes. I downloaded and untarred it and took a quick look. Here are my initial observations.

Pro:

9,000 image files with class and bounding box labels. More than some of the toy datasets out there that have only a few hundred images, but still manageable for doing quick experiments
101 classes (they call them categories). Again, more than many toy datasets, but still easier to consume than say ImageNet’s 1000
image files are small, roughly 300x300, so they don’t take a lot of space, either on disk or in RAM, load quickly, process through the CNN quickly
adequate fidelity for most images I have looked at, though see below

Con:

the labels, or annotations they call them, are stored as MATLAB files. You’ll have to write a subroutine or use one from a library to open and read them. I used scipy.io
the images are not all the same size, which is an extra headache to deal with. Resizing to a standard input shape for the CNN isn’t that big of a deal, but then you have to mess with the bounding boxes, too.

./CalTech101/Images/beaver_0045.jpg is: (300, 175)
./CalTech101/Images/beaver_0046.jpg is: (300, 203)
./CalTech101/Images/bonsai_0001.jpg is: (280, 300)
./CalTech101/Images/bonsai_0002.jpg is: (265, 300)
./CalTech101/Images/crayfish_0055.jpg is: (300, 147)
./CalTech101/Images/flamingo_0002.jpg is: (80, 300)

most of the images have reasonable foreground and background complexity, but some have a completely white background, and some of the images are not from photos, but are cartoons.

Screenshot 2023-03-15 at 8.53.09 AM

So far I have only found single object images, so maybe not so interesting for YOLO.

Overall, it seems adequate for small experiments and self-enablement projects only. If you’re hoping to train your own autonomous vehicle, keep looking.

https://data.caltech.edu/records/mzrjq-6wc02

NOTE: TensorFlow offers a pre-built TF Dataset version but it is only for classification…doesn’t include the bounding boxes

TMosh · April 14, 2023, 5:48pm

Thanks for posting this.

Topic		Replies	Views
Doubt during custom training Convolutional Neural Networks	4	608	April 22, 2022
Any recommendations for Semantic Segmentation dataset? Convolutional Neural Networks	3	532	July 20, 2023
Some Experiments with the Cat Recognition Assignment (C1W4A2) Neural Networks and Deep Learning week-4	3	1071	July 25, 2022
YOLO- Training dataset Convolutional Neural Networks week-2	3	41	January 17, 2025
Dataset mismatch while training AI Discussions project	1	16	April 4, 2025

Initial evaluation of the CalTech101 dataset

Related topics