Here is an overview of YOLO in our exercise. (Note that there are so many different implementations of YOLO concepts. So, it is sometimes difficult to explain YOLO internals…)
As you see, we split a whole image to 19x19 grid. And, each grid is 32x32 pixel size, and has 5 anchors. The shape of anchor differs so that different types of objects can be detected.
Each anchor block can have information for one object. And, its information consists of size (x, y, w, h), objectness (Pc), and the probability distributions for object classes. With this, in the maximum case, we can detect 5 objects in one grid.
If there is an object, the value for Pc becomes closer to 1, and the probabilities of object (like car, cat, bicycle, .etc) are set. The box size is also set by the neural network. If there is no object, then, the value for Pc becomes closer to 0. (And, the box size (x, y, w, h) is 0.)
The key point here is, we have “pre-defined” classes that we want to detect. In our case, 80 classes are defined.
[Training phase]
In the training phase, which is not part of our exercise, we use the ground truth images, which include the class information (e.g., index for car, bicycle,…) and the bounding box information.
The network is trained to detect objects with using anchor boxes. And, key point here is the loss function, i.e, what should be minimized. In yolo_loss function, it evaluates the coordination difference (position and size of a box), the objectness difference (Pc), and the classification differences (class type). With this, the neural network is trained to detect objects with box size information, objectness and object type.
[Inference phase]
Output from the network is bounding box candidates and its objectness/class type from 19x19 grids. There is no concept of “background”, since it is not part of pre-defined classes. It is just “nothing” in there, i.e, no box information, zero Pc, and some noises for probability distributions.
By filtering and non-max-suppression, we can select the most possible bounding box and class information from the output of the neural network.
Then, back to your question, “don’t care” is equal to “no object found”, i.e, Pc=0 and box information is 0. There is no explicit definition of background, since the network only learns pre-defined classes with the ground truth. All others are background.
Hope this helps.