[Project Share] Scalable Label Statistics + Stratified Sampling from Pascal VOC using AWS Serverless

Yagmur_Gulec · September 8, 2025, 4:10pm

Hi everyone!

I recently built a serverless pipeline to analyze the Pascal VOC object detection dataset and generate balanced train/validation splits using stratified sampling. It’s designed to be scalable and modular, using:

AWS Lambda + SQS + DynamoDB for async job submission and tracking
Athena + Glue to run SQL over Parquet-converted VOC data
Balanced downsampling (e.g., max 500 instances per label) to avoid class imbalance

The output includes:

Label distribution stats (before/after sampling)
Stratified splits for training
Job tracking in DynamoDB

I’d love your feedback on:

How have you handled class imbalance in object detection tasks?
Did you use techniques like oversampling, focal loss, or class weighting?
Any thoughts on better strategies for downsampling or maintaining rare classes?
Tips for scaling this to larger datasets (e.g., COCO, OpenImages)?
Ideas for visual dashboards to explore the splits interactively?

Thanks in advance for your thoughts and experiences — I’d really love to hear how others approach this challenge!

Topic		Replies	Views
Class imbalance in object detection using YOLO Convolutional Neural Networks coursera-platform	17	3562	June 13, 2023
Training set label distribution AI Discussions ai-discussions , data-centric	2	86	January 3, 2022
Handling Shift in Data Distribution with DataOps/MLOps AI Discussions ai-discussions , data-centric	1	66	May 18, 2023
CNN models with a small dataset of images - are the results meaningful? AI Discussions	8	109	June 27, 2022
Building ML model for increasing loan acceptance rate by targeting specific customers AI Discussions feedback , ai-discussions , project	21	342	September 11, 2024

[Project Share] Scalable Label Statistics + Stratified Sampling from Pascal VOC using AWS Serverless

I’d love your feedback on:

Related topics