[Project Share] Scalable Label Statistics + Stratified Sampling from Pascal VOC using AWS Serverless

Hi everyone!

I recently built a serverless pipeline to analyze the Pascal VOC object detection dataset and generate balanced train/validation splits using stratified sampling. It’s designed to be scalable and modular, using:

  • AWS Lambda + SQS + DynamoDB for async job submission and tracking

  • Athena + Glue to run SQL over Parquet-converted VOC data

  • Balanced downsampling (e.g., max 500 instances per label) to avoid class imbalance

:magnifying_glass_tilted_right: The output includes:

  • Label distribution stats (before/after sampling)

  • Stratified splits for training

  • Job tracking in DynamoDB

I’d love your feedback on:

  • How have you handled class imbalance in object detection tasks?

  • Did you use techniques like oversampling, focal loss, or class weighting?

  • Any thoughts on better strategies for downsampling or maintaining rare classes?

  • Tips for scaling this to larger datasets (e.g., COCO, OpenImages)?

  • Ideas for visual dashboards to explore the splits interactively?

Thanks in advance for your thoughts and experiences — I’d really love to hear how others approach this challenge!

1 Like