I just want to ask, what is the key difference between YOLO and Sliding window with convolution algorithms?
With a sliding window approach, you would run each window through a ConvNet that only tells you what object it detects in this window.
In YOLO, you run a ConvNet on the whole image. You go through a large part of the whole net before even starting to “divide the image”. Then, at different places in the net (representing different object sizes) you run final object classification and bounding box regression (so you don’t only detect which object is in the anchor box, but also four numbers that represent an offset for a bounding box within that anchor box). These predictions are done on a very abstract representation of the image, having been through a lot of convolutions already, so intuitively speaking at this point the net already has a very rough idea of what object to expect where.
Instead of running a deep net on a number of windows at different places and different sizes you run a deep net once on the whole image, and while doing that you run tiny sub-nets on your anchor boxes to do the final classification and regression.
Thanks a lot @jenitta.jebaraj ! I guess the bounding box accuracy is the main reason!
Near real-time frame throughput is the main reason. There are ways to get higher bounding box accuracy with sliding windows or other algorithms if you can afford the processing cost and time. What made YOLO so interesting especially when it first came out was substantially improved speed without completely compromising accuracy.
One needs to be cautious when describing how “YOLO” algorithms work as there have been many versions and implementations over the years since it was originally introduced in 2016. The version on which the exercise in this class is based is v2. That version does not do at different places in the net (representing objects of different sizes) as described above. It also didn’t run tiny sub-nets on anchor boxes. This sounds like v3 to me, which helped YOLO deal better than v2 did with objects of different scale, especially small ones.
EDIT: adding some quotes from the v3 paper regarding accuracy and speed compared to then-state-of-the-art options
In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though.
Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster.
Finally, to the best of my knowledge, v4 is considered the current state of the original bloodline of YOLO (there are usurpers…the war of succession is complicated). If you really want to understand how “YOLO” works, maybe look at…
YOLOv4: Optimal Speed and Accuracy of Object Detection
@ai_curious , now that’s what I call explanation and thanks for the papers attached …I was a bit confused on why we can’t use sliding windows algorithm with better bounding box accuracy and the reply given to me earlier was a CP from reddit, it’s computationally costly is the main reason…I really appreciate you taking your time to explain clearly!
Of course you can do that; like all engineering (should be) it is predicated on the business problem at hand. If accuracy is paramount and frame rate isn’t as important, I don’t know maybe medical diagnosis, then go with something else. If you are trying to keep your autonomous vehicle more or less near the center of its driving lane, you can afford a few pixels of bounding box inaccuracy in exchange for the high throughput. Its really an optimization question. YOLO’s basic design choice was to go to a single forward pass on the entire input image at prediction time, which meant accepting slightly degraded accuracy metrics to gain substantial speed improvement. The confusion matrix ratios and the cost of mistakes dictates whether the same choice will work for you and your customers/users. Hope this helps.
Yup it helps a lot…It’s really the problem @ hand that mostly dictates the algorithm (optimizing technique). I can’t thank you enough and also are there any books that you will recommend on optimization algorithms? I would love to get your recommendations!
I don’t know that books can keep up with the diversity and pace of research. I’d look to conference proceedings and communities of interest related to computer vision.
I don’t think it is an optimization algorithm per se. You just have to work through the functional and non-functional requirements: still images or streaming/video? What throughput is required? What thresholds must the confusion matrix achieve? What training data is available? What is the operational platform? etc and see which solution offers the best overall value.