So I have successfully completed this course but must admit I still have some conceptual related questions I am having a hard time wrapping my mind around just yet. I may also update these questions as I get closer to an answer, but here goes:
- In YOLO, I understand some of the really big advantages are: a) no more need for a ‘sliding window’ b) many of the computations can now be replicated, or lets say ‘no longer duplicated’.
However (even with something small in class as presented, like a 19 x 19 grid), I guess I am not clear as to how, for the bounding box, the algorithm manages to find the ‘center’.
I mean, maybe you could find a subset of similar or associated regions, and then from there say ‘okay, these are our high probability regions’-- Let’s take the X,Y min/max from those and find our center.
But (at least apparently) I don’t see that in our calculations, so I am wondering how on Earth the algorithm pulls that off (?)
- I understand Neural Nets have been found to do some amazing things, but if we are only then, in YOLO, looking at a bunch of tiny boxes at a time… How can it possibly find an association ? Or as Prof. Ng admits and most of us know, humans are way better at this.
But, granted an image not taken from the class, as a person, if you showed me just this:
I would have no wild idea what that was, but the image it is extracted from is exactly this:
So it is not exactly clear to me how this is happening. I mean, I think, personally, despite our innate ability, I feel our visual recognition comes totally from the context. We see the ‘entire scene’ and then decide, ‘does this make sense being there’ ?
Maybe YOLO is doing the same, but we are not looking at little boxes one at a time.
Like, the first time you see Guernica, the canvas is just so big in person, you only see it ‘all at once’.
-
I am trying to still do my own look up on this, but I find the 1 x 1 convolution confusing. Is this a convolutional version of something that is ‘fully connected’ ?
-
And I’ll just leave off on this for the moment-- In the ‘upscale’ stage of UNets, I can only imagine these filters are crucially being produced by the skip transfers of the activations on the ‘downscale’ side, correct ? Or how on Earth does it know what to train to as we head back up ?
*Oops, forgot one thing-- With regard to ‘anchor boxes’, we were kind of just ‘handed them’ in the assignment, but in actuality, is this something you manually design (like a feature, or as you would in your train set ?). Or is this something an algorithm can decide ?