As there are lots of derivatives, it may be difficult to talk about “implementations”. But, I suppose essentials are covered by @ai_curious.
For better understanding of “what is object recognition ?”, we may be better to start from a bird’s-eye view.
Object recognition is a general term, and includes three tasks.
- Localization : identifies a location of an object with a bounding box
- Classification : predicts the type or class of an object
- Object Detection : includes localization and classification for all objects in an image
YOLO conducts localization and classification at the same time with setting appropriate loss function for those. So, at a training time, as described above, YOLO needs a set of images and annotations which specify 1) a class of an object in an image, and 2) bounding box size. So, the output from a backbone network are 1) localization information (bounding box information), and 2) classification information (like car, traffic light, …). One important addition is “confidence” to indicate how much a bounding box covers target objects for further box suppressions.
In the first two releases, YOLO used Softmax for classifications.
YOLO v1 :You Only Look Once: Unified, Real-Time Object Detection
YOLO v2 : YOLO9000: Better, Faster, Stronger
But, it was changed at V3 with following reasons. Both are from YOLO v3 paper, YOLOv3: An Incremental Improvement
- We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers.
- Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.
So, you caught good points actually. 
As there are many derivatives, some may still use Softmax. But, the YOLO concept is the above.
And, I think above covers your second part.
Regarding “unseen data”, those are just a background. YOLO is trained based on given class definitions and annotations on images. So, if an object is not in a class list, then, it is considered as a background.
Hope this clarifies.