First you need to clarify what object detection algorithm you are asking about. In YOLO, which is normally what is being discussed in this class when we see 19x19 grids, there is no ‘joining’ because each prediction is of a full object. Predicted bounding boxes are not constrained to be no larger than a grid cell. The mechanism for that is covered in (several) existing threads.

Glad that helped. An important takeaway from the equations for predicted bounding box center location and shape is that the YOLO CNN does not output them directly. Rather, the net outputs values that are cleverly set up to be on the same scale (ie 10^0 ) as the object presence and class confidence slash probability. This allows them all to play nice in the loss function as well as be treated as a single overall regression,rather than having separate pipelines and models for the classification and regression elements. The equations for shape also show the importance of choosing good anchor boxes, or priors, since they are multiplicative factors in the shape outputs. Mathematically I guess the shape could be between 0 and positive infinity number of pixels. Practically, the lower bound is at least 1, since a bounding box can’t have less than 1 pixel height and width, and probably really 3 or more, since it’s unlikely you’d get features out of objects any smaller. The upper bound is the size of the input image itself. When establishing the training data, you reverse engineer from the actual shapes to the t_i the network would need to generate to produce them, then the training and loss function take over.