Convolution implementation of sliding window and YOLO are different methods? Correct me on this.
What i think is sliding window is better at predicting the accurate boundary predictions as , in YOLO the object might be in a position where it is shared across multiple grids, which might be less problematic in sliding window as the at some window the object might fit.
And as mentioned in the lecture video on YOLO ,suppose we design the Convnet in such a way that 100X100 image yields a output of 3X3X8 if we consider 3X3 grids of input image. But if in training we use a different shape test image like 124X124 how can the same Convnet yield a output of 3X3X8, it cant right the dimensions will be different
So even if we want to the YOLO to work, i think we need to train the Convnet with a set of images which have cars not only in fourth or sixth grid, there should be images with cars in each grid, if not if the test image has a car in first grid can it still classify it? Please help me on this queries and thanks in advance
YOLO and sliding window are different approaches. YOLO processes the entire image in one forward pass, using a grid system and anchor boxes, making it faster and more efficient than the sliding window method.
For most real-world applications, YOLO strikes a good balance between speed and accuracy.
Yes. Resizing is a standard preprocessing step to make YOLO robust to varying input sizes.
To avoid this, I believe it’s crucial to have a diverse training dataset with objects distributed across all grid cells. This ensures that the model learns to detect objects irrespective of their positions in the image.
To give a little more detail on question 2) in addition to Kader’s excellent response, note that the training of YOLO to recognize objects is not as focussed on the grid cells as you might expect. The grid cells are primarily used as a convenient way to organize the presentation of the results. There is no requirement that an object be contained completely within a grid cell, but the object is assigned to the grid cell that contains the centroid of the object. That also makes the NMS post processing more efficient, since it’s unlikely that two objects presented in the output are really the same object if their centroids are in different grid cells.
YOLO is by far the most sophisticated algorithm we have seen so far in DLS. There are a number of threads on the forum that explore various aspects of YOLO in quite a bit more detail than is covered in the lectures. For example, here’s one that talks about how grid cells and anchor boxes are used in YOLO. And here’s one that talks about the Non-Max Suppression that I referred to earlier.
You’re on the right track here, but think even more expansively. Here’s a direct quote from the 2016 YOLO v2 paper …
During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.
Detector locations, which is what the Redmon et al team called grid location + anchor box tuples, that have never been trained with positive examples will only know how to predict negative examples. So spatial augmentation is an important element of successful training on a YOLO dataset. HTH
@Krishna39 at one point several years ago I went down the rabbit hole of trying to train a YOLO model from scratch. Sparse grid cell coverage in the default (ie no spatial augmentation) data set was one of the issues I ran in to. Here’s a link to some observations I made at the time about training YOLO…
Thanks for clarifying my queries, so happy that QA forum is active and boosting interactive environment for learners. @ai_curious for sharing the thread i will have a look