Hi everyone!
I recently built a serverless pipeline to analyze the Pascal VOC object detection dataset and generate balanced train/validation splits using stratified sampling. It’s designed to be scalable and modular, using:
-
AWS Lambda + SQS + DynamoDB for async job submission and tracking
-
Athena + Glue to run SQL over Parquet-converted VOC data
-
Balanced downsampling (e.g., max 500 instances per label) to avoid class imbalance
The output includes:
-
Label distribution stats (before/after sampling)
-
Stratified splits for training
-
Job tracking in DynamoDB
I’d love your feedback on:
-
How have you handled class imbalance in object detection tasks?
-
Did you use techniques like oversampling, focal loss, or class weighting?
-
Any thoughts on better strategies for downsampling or maintaining rare classes?
-
Tips for scaling this to larger datasets (e.g., COCO, OpenImages)?
-
Ideas for visual dashboards to explore the splits interactively?
Thanks in advance for your thoughts and experiences — I’d really love to hear how others approach this challenge!