Short answers: Yes. Augmentation.
You probably want to think about incorporating augmentation anyway, because you want as many detector locations as possible exposed to a diversity of types. That is, if you always train on images with a cavity only right in the center of the image, the other grid cell locations will never be trained to predict that class. Also, to generalize well, you are likely to need to account for somewhat different views. Augmenting with spatial repositioning is likely to help with generalization, and while you’re doing it, you can plus up the underrepresented classes.
Be forewarned that augmentation with algorithms that use bounding boxes introduces additional complexity since you have to keep the bounding box coordinates in synch with changes to the image.
It would be interesting to compare accuracy with and without balancing. Let us know how it goes.