Course-4 Week-3 Assignment-Ex-1:FILTER thresholding with a class score


How exactly is a (19,19,5,85) tensor mathimatically equivalent to (19,19,425) and that is equivalent to 3 variables with dimensions box_confidence:(19,19,5,1) boxes:(19,19,5,4), box_class_probs (19,19,5,80)?? if it is broadcasting it should only get broadcasted to box_class_probs shape.

@paulinpaloalto @ai_curious

1 is for the predicted object presence probability p_c
4 is for the predicted bounding box center location (b_x, b_y) and shape (b_w,b_h)
80 is for the number of classes in the MS COCO dataset COCO - Common Objects in Context

19*19 is the number of grid cells
5 is the number of anchor boxes

85 = (1 + 4 + 80)
425 = 5 * 85

The total number of values in the training input and in the network output is (19*19*5*(1+4+80))=153,425 which you are free to stack or flatten into any shape that is convenient.

Take a look at this thread Detecting Multiple Objects using YOLO - Grid Cells plus Anchor Boxes

The slicing is basically stripping off some or all of one of the layers of the 4D object, depending on what you need to do with the data.( NOTE: The diagrams on that post use 3 for the number of grid cells, because that is simpler to draw, but the idea is the same. ) If you just want to work with presence probability, the bounding box shape, or the class prediction, you can use Python slicing to extract those elements from the larger object. If you pull out just the box shapes, you will have a (19,19,5,4) object. If you pull out just the class vector, it will be (19,19,5,80). If you want to manipulate everything behind the grid cells in one flattened vector, it will be (19,19,425) etc

1 Like