Here is what this looks like in terms of the YOLO v2 model itself. I built the CNN using the 608x608 Berkeley Driving Data image used in the previous thread, a 19x19 grid shape, 8 dimension clusters/anchor boxes, and 1 class (cars only for now). Or S*S*B*(1+4+1) = 19*19*8*6
You can see the 608x608x3 input shape in the input layer, and the 19x19x8x6 shape in the output layer. The Conv2D, BatchNorm, MaxPool, LeakyReLU etc layers as well as the filter number, stride, and padding are taken right from the YOLO v2 paper, including the skip connection between conv2d_13 and conv2d_20 (not shown in this excerpt)