In the object detection introduction, Andrew says we first need to train the model on classification and then on sliding windows object detection.
- I am assuming we should do the same with YOLO. Is that correct?
- Can we just crop the objects from the object detection dataset and use it for classification?
I think that what he means is that in object detection first you have to find the zones of interest in the image and then classify the objects in those zones. The YOLO model does that by itself no need for you to determine the zones of interest beforehand.
Can you use cropping you say and then use classification? As far as I remember all the algorithms present in that course do that by themselves, no need you to do it manually.
1 Like
This was because pre-YOLO treated the two pipelines as separate learning tasks: one classification, one regression. YOLO treated them both as regression, so both could be accomplished in the same (single) forward pass. Notice, however, that while this worked well for runtime predictions, it complicated training, and at least the early versions of YOLO approached the problem using transfer learning. That is, they trained first on classification, then modified the head of the network and further trained for localization.
From the YOL09000 / v2 paper…
Training for classification. We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework [13].
Training for detection. We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters.
At runtime it uses the fully trained final classification + localization network architecture.
There is a reason almost all of the YOLO-related papers or blogs you find on the web include ‘…we started with a pretrained model…’
1 Like