I want to know how the training dataset especially the output dataset looks like.

Do we create it meticulously having shape of (m,19,19,5,85) where 19 is number of grid lines and 5 is number of anchor boxes and 85 is the number of classes or just (pc, bx,by,bh,bw) relative to image and propagate convolutionally for every grid?

Hi @Ayaman

The output has the shape (m, 19, 19, 5, 85), where 19 is the number of grid cells per dimension, 5 is the number of anchor boxes per grid cell, and 85 includes the bounding box parameters (pc, bx, by, bh, bw), object confidence score, and class probabilities for 80 classes. The labels are propagated convolutionally to fit the YOLO model’s grid and anchor system.

The former is correct: you must create training data in the exact shape of the YOLO network output. Forward propagation accepts X input and creates \hat{Y} output, the predictions. The loss function computes difference between the predictions, \hat{Y} and the training data Y, which is why they need to be the same shape. The labelled training data, Y, is * not* what flows through the convolutional network…that is the image input X.

Y is created using a straightforward mapping to assign the label values to the correct matrix locations…it is * not* created using the YOLO or any other neural network.