@BryanEL here are some thoughts from my experience training YOLO v2 level code on a custom dataset that have been memorialized in other threads:
Here’s a tldr
training YOLO from scratch is completely non-trivial. It takes a lot of data that you know well and a lot of computation. You need to do augmentation to ensure objects exist in all locations of the training set, otherwise you train predominantly on finding objects in the center of an image and most grid cell locations have no training at all.Grid cell and anchor box size needs to be tuned to the data, especially if you expect many objects small relative to image size. Speaking of anchor boxes, it is very important that you not use default anchor box count and shapes if those don’t match your dataset as it will lead to poor training outcomes. You will need to do the k-means analysis yourself to find the optimal number of and shapes of anchors for your data. Every object detection data set I looked at either used JSON or XML and every one had a complex record that included the bounding box and the class. If they are not paired in the training data, you can’t do detection (meaning localization + classification) until they are. The data set I worked with, mentioned in and linked from those threads, used JSON. Each record had lots of things I didn’t care about, so I had to write my own crawler to extract the bits I wanted. At least circa v2, which is what this course was based on, the YOLO creators broke the pipeline apart and had different architecture for location training and classification training because trying to learn all the parameters at once was too hard. My experience suggested I should do the same but I never completed the refactoring.
If you do go down this path, please post your experience.