I have a different question about how is the manual work to build a training sets YOLO.
As pointed out by Andrew, first you train your ConvNet with a set of closely cropped images of different classes of objects. So, I think that it is safe to assume that with those images that have the object, it is always true that bx = by = 0.5. And also that bh=0 and bw=1. So you don’t have to manually measure the location of the center of the object (because it is always in the center!).
Later, at the test time, you use that trained ConvNet in a sliding window algorithm (or its convolutional implementation).
Is that correct? Thanks to all