From what I have seen, CNNs we have been working on have been trained on square images. In week 3’s first assignment however, the predictions are made on rectangular images. It’s said that YOLO’s network was trained to run on 608x608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720x1280 images – the last step in the assignment rescales the boxes so that they can be plotted on top of the original 720x1280 image.
How does this work actually? What is happening under the hood? The filter sizes, strides, basically everything has been adjusted to a particular input size. So, if we were to run YOLO in one of our projects, how do we go around? What does preprocessing do?
Many thanks and have a nice day!
One way to use the existing yolo model would be to resize your image to the input shape that’s accepted by the model. See
preprocess_image function defined in
utils.py. The image is read and resized to
(608, 608, 3).
Thanks a lot. I took a look at the function. From its documentation, as much as i understood, it downsamples the image and apply bicubic interpolation (still not very clear as to why interpolation would be needed while downsampling. If we were to enlarge the image, that would make a lot more sense).
The main thing that confuses me is the fact that the shape of the image was shrunk, so did all the objects in the image, i.e the aspect ratio was not preserved. So, if the algorithm was trained on “normal” looking objects, like cars for example, how can it work properly on the cars that have different aspect ratios which make them look longer and slimmer? Are the image detection algorithms invariant to changes in aspect ratios?
Thanks a lot in advance!
“downsampling” can be done in a straightforward way or in a more sophisticated way by including interpolation. Just picking (e.g.) every other pixel doesn’t necessarily give you the best quality results and you also have to deal with “quantization effects” (suppose the ratio of shrinkage does not evenly divide the number of pixels in a given dimension). Image processing is a whole discipline unto itself with lots of “prior art”. Google is your friend.
Generally speaking there is nothing about CNN algorithms that requires the inputs to be square. If you are training your algorithm from scratch, then it’s a decision you as the system designer need to make about the shape and resolution of your input images. If you’re using Transfer Learning, then you are obviously constrained by the input definition of the system you are using as your starting point. It’s a good question whether you’ll have problems if you have to resize your images from non-square to square with the resulting distortion of aspect ratios. Aspect ratios are a big deal in YOLO with the anchor boxes and all. I don’t have any experience applying YOLO, but I hope someone else listening here will be able to shed some actual light on that question.
Thank you very much! The article you shared seems very interesting: just like you said google is my friend, but for sure IEEE isn’t I will try to get my hands on that article by asking around
for the second paragraph, it’s perfectly clear and thank you very much for sharing your valuable insights!
Oh, sorry, I just read the Abstract and didn’t click through, so wasn’t aware it requires a login. But the point is this is a topic that’s been around for a while (note that the publication date on that paper is 2011). I’m sure a few searches will turn up some “open source” material on this topic.
No worries! I did research though. I had one potential good hit but it had some problems. I will probably contact the authors Thanks a lot for your kind help and time!