You can easily start with Y the correct shape and initialized to 0. Then, overwrite just the locations that have training labels.
It is straightforward to determine in which grid cell the center of the object is located. First, determine how many pixels map to each grid cell. Then, from your bounding box label, determine the pixel coordinate of the object center. From those two numbers, determine how many grid cell units the object is offset from the origin. In the autonomous driving exercise the images are 608x608 and there are 19x19 grid cells, so each grid cell is ‘assigned’ a patch of 32x32 pixels. An object in the center of that image would be assigned to the S_x = 8, S_y = 8 grid location (assumes 0 indexed). An object with its center less than 32 pixels from the upper left hand corner would get S_x = 0, S_y = 0.
Anchor box assignment is more complicated. First, you determine optimal anchor box shapes for the training set using unsupervised learning. Then, you assign a training object to its best anchor based on IOU. Other threads in this forum cover it in more detail.
You end up with a Y matrix of shape (S,S,B,(1+4+C)) that is fully populated. All are zeros except for the positions (S_x, S_y, B, …) obtained from your labelled data. Note that since the loss function is basically doing Y - \hat{Y}, those matrices need to be the same shape.
Some other related threads: