Convolution implementation of sliding window and YOLO are different methods? Correct me on this.
What i think is sliding window is better at predicting the accurate boundary predictions as , in YOLO the object might be in a position where it is shared across multiple grids, which might be less problematic in sliding window as the at some window the object might fit.
And as mentioned in the lecture video on YOLO ,suppose we design the Convnet in such a way that 100X100 image yields a output of 3X3X8 if we consider 3X3 grids of input image. But if in training we use a different shape test image like 124X124 how can the same Convnet yield a output of 3X3X8, it cant right the dimensions will be different
So even if we want to the YOLO to work, i think we need to train the Convnet with a set of images which have cars not only in fourth or sixth grid, there should be images with cars in each grid, if not if the test image has a car in first grid can it still classify it? Please help me on this queries and thanks in advance
YOLO and sliding window are different approaches. YOLO processes the entire image in one forward pass, using a grid system and anchor boxes, making it faster and more efficient than the sliding window method.
For most real-world applications, YOLO strikes a good balance between speed and accuracy.
Yes. Resizing is a standard preprocessing step to make YOLO robust to varying input sizes.
To avoid this, I believe itâs crucial to have a diverse training dataset with objects distributed across all grid cells. This ensures that the model learns to detect objects irrespective of their positions in the image.
To give a little more detail on question 2) in addition to Kaderâs excellent response, note that the training of YOLO to recognize objects is not as focussed on the grid cells as you might expect. The grid cells are primarily used as a convenient way to organize the presentation of the results. There is no requirement that an object be contained completely within a grid cell, but the object is assigned to the grid cell that contains the centroid of the object. That also makes the NMS post processing more efficient, since itâs unlikely that two objects presented in the output are really the same object if their centroids are in different grid cells.
YOLO is by far the most sophisticated algorithm we have seen so far in DLS. There are a number of threads on the forum that explore various aspects of YOLO in quite a bit more detail than is covered in the lectures. For example, hereâs one that talks about how grid cells and anchor boxes are used in YOLO. And hereâs one that talks about the Non-Max Suppression that I referred to earlier.
Youâre on the right track here, but think even more expansively. Hereâs a direct quote from the 2016 YOLO v2 paper âŚ
During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.
Detector locations, which is what the Redmon et al team called grid location + anchor box tuples, that have never been trained with positive examples will only know how to predict negative examples. So spatial augmentation is an important element of successful training on a YOLO dataset. HTH
@Krishna39 at one point several years ago I went down the rabbit hole of trying to train a YOLO model from scratch. Sparse grid cell coverage in the default (ie no spatial augmentation) data set was one of the issues I ran in to. Hereâs a link to some observations I made at the time about training YOLOâŚ
Thanks for clarifying my queries, so happy that QA forum is active and boosting interactive environment for learners. @ai_curious for sharing the thread i will have a look
Can I say the lecturer made a mistake in this video
when he said
Youâre going to place down a grid on this image. And for the purposes of illustration, Iâm going to use a 3 by 3 grid. Although in an actual implementation, youâd use a finer one, like maybe a 19 by 19 grid. And basic idea is youâre going to take the image classification and localization algorithm that you saw in the first video of this week, and apply that to each of the nine grid cells of this image.
since we actually donât perform the predict by sending each of the nine cells into the YOLO model, but sending the whole picture per discussion here and in other related post, as some bounding boxes might labeling an object across grid cells? And it make no sense to predict it with the image info only from a single grid cell?
You are right to point out that the YOLO algorithm operates on the whole image at once. The grid cells are just used as a way to organize the output results. But Iâd have to go back and listen to the earlier lecture that he is referring to in order to get the full context here. I wonât have time to do that today. My guess is that weâre just reading too much into what he means by that statement.
Your observation is correct that with YOLO weâre not talking about running forward propagation for each grid cell as a separate input. Indeed, later in the video Prof Ng says explicitly this is a convolutional implementation, right? Youâre not implementing this algorithm nine times, on the 3 by 3 grid, or if youâre using a 19 by 19 grid, 19 squared is 361. So youâre not running the same algorithm, you know, 361 times or 19 squared times. I think the reference to the week 1 lecture is to suggest that the output vector for the single object localization algorithm and the YOLO output vector for each grid cell, the 8 values p_c, b_x, b_y, b_w, b_h, c_0, c_1, c_2, are conceptually the same.
Where the week 1 algorithm outputs a vector for a single object per input image, this explanation of YOLO can do it for one object for each grid cell*, with faster throughput than sliding windows and better localization accuracy, too.
*actually, YOLO can detect more than one object per grid cell, but the mechanism that enables that hasnât been introduced by this point in the lectures. See anchor boxes in later videos or extensive discussion elsewhere in this forum. HTH
I think this is just how convolution implementation run faster than sliding window operate on the whole image across the grid cells and it will be lost if run separately for each since there are plenty of share computation results while filtering across the boarder of cells.
BTW is that truth that the grid cell conception introduced here are mainly for 2 reasons?
it help to organize the output to be more tidy when there are plenty of objects can be detected(along with the anchor box concept). Thus the labeler can easily judge which cell is chosen by observing the central point of object
Since each cellâs output link the location in the model output, once the link is identified by the networking during training, the whole training process would be more efficient since it discover some âruleâ which is define by human
Thus the grid cell concept rather than explicitly but implicitly impact the image operation by the algorithm?
This is how I think about it. Back in the day (2012?) it was a big deal to demonstrate that a CNN could condense the information from an image into a single value - image classification. That was followed shortly by the ability to produce 4 values - object location, then 5 values - object detection. The problem was that the networks could only handle a single object per image. During the 2014 time period, lots of work was going on to make object detection practical and useful by 1) speeding it up, 2) making it more accurate, and 3) making it work with more than a single object. The best algorithms of the day achieved one, or two of those objectives, but not all 3. The paradigm shift of YOLO circa 2015/2016 was that it could do all 3. It worked on multiple objects, was very fast, and was acceptably accurate.
The grid cells in effect define âdetectorsâ trained to make predictions about objects centered in their region of responsibility and to ignore objects in other regions of the image. In my mind, this isnât an optional or merely âhandyâ feature, but rather a quite fundamental and explicit part of the YOLO idea. And I probably wouldnât call it a rule, since it is communicated to the learning algorithm/cost function in the same manner as any supervised learning task would do; using 0 and 1 in the p_c slot. The rule, if you want to call it one, would be âIf there is a 1 in the training data, there is an object you are responsible to detect, otherwise there is not.â
By the way, in my own experience training a YOLO network, the ability to correctly predict p_c is an underrated capability. If that is wrong, it wonât matter how good the localization and classification are because the system will be compromised by the high rate of false positive and false negative results. The lectures donât emphasize this part of the prediction output, but it is crucial to get it right.