Queries regarding YOLO and Sliding window

Krishna39 · December 23, 2024, 7:35am

I have few queries:

Convolution implementation of sliding window and YOLO are different methods? Correct me on this.
What i think is sliding window is better at predicting the accurate boundary predictions as , in YOLO the object might be in a position where it is shared across multiple grids, which might be less problematic in sliding window as the at some window the object might fit.
And as mentioned in the lecture video on YOLO ,suppose we design the Convnet in such a way that 100X100 image yields a output of 3X3X8 if we consider 3X3 grids of input image. But if in training we use a different shape test image like 124X124 how can the same Convnet yield a output of 3X3X8, it cant right the dimensions will be different

image1097×577 141 KB
So even if we want to the YOLO to work, i think we need to train the Convnet with a set of images which have cars not only in fourth or sixth grid, there should be images with cars in each grid, if not if the test image has a car in first grid can it still classify it? Please help me on this queries and thanks in advance

saou_a · December 23, 2024, 3:04pm

YOLO and sliding window are different approaches. YOLO processes the entire image in one forward pass, using a grid system and anchor boxes, making it faster and more efficient than the sliding window method.
For most real-world applications, YOLO strikes a good balance between speed and accuracy.
Yes. Resizing is a standard preprocessing step to make YOLO robust to varying input sizes.
To avoid this, I believe it’s crucial to have a diverse training dataset with objects distributed across all grid cells. This ensures that the model learns to detect objects irrespective of their positions in the image.

paulinpaloalto · December 23, 2024, 3:56pm

To give a little more detail on question 2) in addition to Kader’s excellent response, note that the training of YOLO to recognize objects is not as focussed on the grid cells as you might expect. The grid cells are primarily used as a convenient way to organize the presentation of the results. There is no requirement that an object be contained completely within a grid cell, but the object is assigned to the grid cell that contains the centroid of the object. That also makes the NMS post processing more efficient, since it’s unlikely that two objects presented in the output are really the same object if their centroids are in different grid cells.

YOLO is by far the most sophisticated algorithm we have seen so far in DLS. There are a number of threads on the forum that explore various aspects of YOLO in quite a bit more detail than is covered in the lectures. For example, here’s one that talks about how grid cells and anchor boxes are used in YOLO. And here’s one that talks about the Non-Max Suppression that I referred to earlier.

ai_curious · December 23, 2024, 8:46pm

+1

You’re on the right track here, but think even more expansively. Here’s a direct quote from the 2016 YOLO v2 paper …

During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.

Detector locations, which is what the Redmon et al team called grid location + anchor box tuples, that have never been trained with positive examples will only know how to predict negative examples. So spatial augmentation is an important element of successful training on a YOLO dataset. HTH

ai_curious · December 24, 2024, 9:56am

@Krishna39 at one point several years ago I went down the rabbit hole of trying to train a YOLO model from scratch. Sparse grid cell coverage in the default (ie no spatial augmentation) data set was one of the issues I ran in to. Here’s a link to some observations I made at the time about training YOLO…

Krishna39 · December 24, 2024, 9:36pm

Thanks for clarifying my queries, so happy that QA forum is active and boosting interactive environment for learners. @ai_curious for sharing the thread i will have a look

Feihong_YANG · February 26, 2025, 7:17am

Hi @paulinpaloalto

Can I say the lecturer made a mistake in this video

when he said

You’re going to place down a grid on this image. And for the purposes of illustration, I’m going to use a 3 by 3 grid. Although in an actual implementation, you’d use a finer one, like maybe a 19 by 19 grid. And basic idea is you’re going to take the image classification and localization algorithm that you saw in the first video of this week, and apply that to each of the nine grid cells of this image.

since we actually don’t perform the predict by sending each of the nine cells into the YOLO model, but sending the whole picture per discussion here and in other related post, as some bounding boxes might labeling an object across grid cells? And it make no sense to predict it with the image info only from a single grid cell?

paulinpaloalto · February 26, 2025, 5:29pm

You are right to point out that the YOLO algorithm operates on the whole image at once. The grid cells are just used as a way to organize the output results. But I’d have to go back and listen to the earlier lecture that he is referring to in order to get the full context here. I won’t have time to do that today. My guess is that we’re just reading too much into what he means by that statement.

ai_curious · February 27, 2025, 4:39pm

Your observation is correct that with YOLO we’re not talking about running forward propagation for each grid cell as a separate input. Indeed, later in the video Prof Ng says explicitly this is a convolutional implementation, right? You’re not implementing this algorithm nine times, on the 3 by 3 grid, or if you’re using a 19 by 19 grid, 19 squared is 361. So you’re not running the same algorithm, you know, 361 times or 19 squared times. I think the reference to the week 1 lecture is to suggest that the output vector for the single object localization algorithm and the YOLO output vector for each grid cell, the 8 values p_c, b_x, b_y, b_w, b_h, c_0, c_1, c_2, are conceptually the same.

Where the week 1 algorithm outputs a vector for a single object per input image, this explanation of YOLO can do it for one object for each grid cell*, with faster throughput than sliding windows and better localization accuracy, too.

*actually, YOLO can detect more than one object per grid cell, but the mechanism that enables that hasn’t been introduced by this point in the lectures. See anchor boxes in later videos or extensive discussion elsewhere in this forum. HTH

Feihong_YANG · February 28, 2025, 3:33am

I think this is just how convolution implementation run faster than sliding window operate on the whole image across the grid cells and it will be lost if run separately for each since there are plenty of share computation results while filtering across the boarder of cells.

BTW is that truth that the grid cell conception introduced here are mainly for 2 reasons?

it help to organize the output to be more tidy when there are plenty of objects can be detected(along with the anchor box concept). Thus the labeler can easily judge which cell is chosen by observing the central point of object
Since each cell’s output link the location in the model output, once the link is identified by the networking during training, the whole training process would be more efficient since it discover some “rule” which is define by human

Thus the grid cell concept rather than explicitly but implicitly impact the image operation by the algorithm?

ai_curious · February 28, 2025, 12:12pm

This is how I think about it. Back in the day (2012?) it was a big deal to demonstrate that a CNN could condense the information from an image into a single value - image classification. That was followed shortly by the ability to produce 4 values - object location, then 5 values - object detection. The problem was that the networks could only handle a single object per image. During the 2014 time period, lots of work was going on to make object detection practical and useful by 1) speeding it up, 2) making it more accurate, and 3) making it work with more than a single object. The best algorithms of the day achieved one, or two of those objectives, but not all 3. The paradigm shift of YOLO circa 2015/2016 was that it could do all 3. It worked on multiple objects, was very fast, and was acceptably accurate.

The grid cells in effect define ‘detectors’ trained to make predictions about objects centered in their region of responsibility and to ignore objects in other regions of the image. In my mind, this isn’t an optional or merely ‘handy’ feature, but rather a quite fundamental and explicit part of the YOLO idea. And I probably wouldn’t call it a rule, since it is communicated to the learning algorithm/cost function in the same manner as any supervised learning task would do; using 0 and 1 in the p_c slot. The rule, if you want to call it one, would be ‘If there is a 1 in the training data, there is an object you are responsible to detect, otherwise there is not.’

By the way, in my own experience training a YOLO network, the ability to correctly predict p_c is an underrated capability. If that is wrong, it won’t matter how good the localization and classification are because the system will be compromised by the high rate of false positive and false negative results. The lectures don’t emphasize this part of the prediction output, but it is crucial to get it right.

Topic		Replies	Views
YOLO algorithm and Sliding window Convolutional Neural Networks	7	757	September 16, 2022
YOLO vs Convolutional Sliding Window Convolutional Neural Networks week-3	13	583	September 29, 2024
Questions about sliding window and YOLO Convolutional Neural Networks	4	738	January 12, 2022
Week 3 Yolo Doubt About Sliding Window Convolutional Neural Networks	7	754	August 18, 2024
Questions about YOLO Convolutional Neural Networks	13	2424	January 23, 2025

Queries regarding YOLO and Sliding window

Related topics