In our exercise, training YOLO is not included, since it is typically time consuming.
I’m curious what yolo’s training set looks like?
YOLO has different versions which take some different inputs. But, basically, there is a YOLO format which consists of “class index” and “bounding box information”. This is a text file, and has the same name as an image file. In parallel, we create a list of classes like “car”, “bicycle”, … for the reference of “class index”.
Do I need to do the labeling manually like u-net?
Basically, we define a bounding box on an image and put annotation, i.e, “class index” in front.
labelImg supports YOLO format directly. Microsoft VoTT (Visual Object Tagging Tool) does not support YOLO format, but generate Pascal VoC (XML). This can be converted into YOLO format, or some derivatives of YOLO directly supports this.
Such as training wine glasses, smoke shaped borders. What tools do I need to use, do I manually make the borders first?
So, first step is to create annotation files with using above tools.
Then, you need to decide which version of YOLO that you use. Don’t use V2.
Lots’ of complaints from this community members. Original YOLO is written in C. An original author stopped to enhance, but several researchers/developers enhanced it. V3 was ported to Keras environment, and V5 is now Pytorch version.
And then go to transfer learning?
Yes, you should start with a transfer learning, since it takes significant amount of time to train a model. Basically, YOLO consists of 3 parts as follows.
- YOLO backbone
- YOLO neck
- YOLO head
YOLO backbone is a relatively large Convolutional network to detect objects. The original backbone network is called “Darknet”. Recent works replace this to MobileNet or others as a research project.
YOLO neck is so called "Feature Pyramid Network) to extract objects (boxes) from different layers. As a default, it has 3 layers. And, outputs are extracted from different layers, and merged with the output from the last layer, which is upscaled. This is another convolutional network layer.
YOLO head is to select anchor boxes with non-max suppression, and finalize confidence level and class.
And, there are some options for a transfer learning depending to which part of the above you want to train again. The important thing is a model is trained for 80 classes. If you do not change the number of classes, then, you have multiple options like ‘load weights for a backbone network only’, ‘fine tuning with loading all weights’ and so on. If you want to change the number of classes, even in this case, you can load “weights” for YOLO backbone and neck.
Of course, there is an option to train from scratch. (I think it is quite difficult to get converged.)
I ported V3 into my latest Tensorflow/Keras environment actually. It works fine. Algorithm itself is not so complex, but, it takes time for training. If you just want to touch, Keras version or Pytorch version should be handy. Do not go with V2. 
Hope this helps.