Hello
I have a question regarding the third lab “Image Classification and Object localization”. What if multiple objects were in the images so that you could have one or more than one object. So the issue I was facing was when I was making matrices of classes let’s say I have two categories so my class matrix would be [ [0], [0,1], [ 1,1] …] 0 for cats and 1 for dogs but the matrices are unequal because there can be one cat in an image and in another it can be two or 4, etc. As the number of objects in the images are not fixed. The same goes for the bounding boxes also.
So, how should I organize the labels and the bounding box matrices to feed to the model?
It is important to recognize that the shape of the labels y provided to the model during training and the output produced by the model \hat{y} must match. If your model can only output a single object prediction, there is no point to input multiple objects’ data.
Multiple object prediction either requires a single object output from a model run multiple times on different subregions of the input image, or multiple objects output from a model run a single time on the entire input image. YOLO, which gets its name from only looking at the input image once, uses the latter approach. Like all things engineering there is no free lunch - each approach has benefits and costs and picking which one to use is business outcome driven ie you optimize on throughput or accuracy or memory footprint or training complexity etc.