Week #2#YOLO
Hello everyone
Wishing you happy December and happy holiday ,
I have a question about the training dataset for YOLO algorithm:
In case the classification problem, the dataset contains images and y label for each image.
So, for YOLO we have 2 problems Classification and Localization(regression), right?
For classifying objects(C1, C2, …), and predicting the boundary box for each obj (Px, Py, Ph, Pw)
My first question is i don’t understand how the dataset looks like for YOLO Alg.,
I mean whether the training dataset containing the classification/localization for each cell in the grid or it just contains the y vector for the objects and it’s localization info, and if so how (Px, Py) be compared with predicted Px, Py -because all cells have the same X and Y range of values- ?
The Second question is depending on if the training dataset contains the y vector for each cell in the grid, if so, that means the testing set should have the same grid size as training grid size?
The dataset includes a txt file for each image, but this is implementation dependent. The file contains rows for the image. Each line represents an object in that image. The first number is the line number of the object’s class. The next 2 numbers are the center of the rectangle surrounding the object, and the last two numbers are the width and height of the rectangle. These data are all normalized between 0 and 1. So x and the width of the box are divided by the width of the whole image, and y and the height are divided by the height of the whole image.
Because of the normalisation, the image size of the teaching and test sets need not be the same. Even within a training set, images do not have to be the same size.
One point to clarify is that the raw data is generally unaware of the computer vision task and algorithm that will consume it. So, typically, your raw data is just a collection of images and associated label files, as described above. The labels contain the location and class of the objects. Typically the raw data contains no information about ‘grids’, which are algorithm-specific.
You must preprocess raw data to prepare it for use in YOLO. That process is described in the linked (and other existing) threads. Basically mapping the raw data onto a multi-dimensional structure with an identical shape to the output of the YOLO neural network. This is then the ground truth matrix used in the cost function. Hope this helps