Hi,
I guess this goes a little bit outside of the support with the programming exercises, but I just want to make sure that I understand how to train a ConvNet using bounding boxes or semantic segmentation.
1.For semantic segmentation, do I have to label every pixel, i.e position, in the ground truth vector y for every picture in my training set? Wouldn’t that take forever?
2.For use of bounding boxes, do I, for every object in every picture, have to draw a bounding box around the object (of interest), calculate IoU with all pre-defined anchor boxes and then place the [p,x,y,b,h,“class-vector”] in y at the position belonging to the chosen anchor?
Are these labeling tasks automated in any ways these days?
Yeah there are out there services that provide annotating and labeling programs (with human guidance of course) that can be used, just google image annotation or semantic labeling. I came across https://roboflow.com/ the other day, Im not advertising them, just mentioning.
The same goes for point 2, there are services out there. And its possible to automate the rest in a programmatic way i.e. writing the projects in python that perform parts of pre-processing, model and post-processing but you have to be aware how these work in the bigger picture.
@gent.spah has it right that the short answers are all ‘yes’
Yes you have to do it, yes it can take a lot of time, yes some of it can be automated, yes many available datasets have done this for you already.
Also note regarding the anchor boxes part, that anchor boxes are not part of all object detection CNN. Grid cells and anchor boxes are specific to the way YOLO handles multiple objects per image. But once you have bounding box label, it is straightforward to implement the algebra to derive the correct grid cell and anchor box index. There are threads in this forum as well as on the interweb that discuss doing just that.