I have a rather theoretical question. I’m thinking about applying the YOLO algorithm to a dataset of images of cancer cells. I have 100 images and each image has bx and by coordinates (the midpoint of the cell) which I want to detect. Additionally, I also have some class label along with the midpoint data. However, I do not have data on the bounding boxes (bw and bh). I’m wondering if I could use a pre-trained YOLO model (freeze the top ) and fine-tune to my own dateset for this task ? I’m not sure if it’s possible because I don’t have bw and bh.
In general the answer to this question is ‘yes’. In fact, the original YOLO authors trained their network first to do classification, then did object location later, in effect a fine tuning of the output of the classification layers. However, I see two possible challenges for trying to do this in your problem space.
First, transfer learning effectiveness depends at least partially on coherence of the content. I am doubtful a YOLO trained on COCO objects is going to perform well on medical images…it won’t have been exposed to the correct classes but even more importantly the features just seem too different. Maybe there is similar training data out in the world (see NIH Chest X-ray dataset | Cloud Healthcare API | Google Cloud for example ) but is there a YOLO model trained on them? IDK. Second, bounding box prediction is a key element of what YOLO is all about. Without bounding box training data, you can’t make useful bounding box predictions. Object width and height will still be part of the loss function and still be part of the network output, but there won’t have any inputs on which to base those values. And if you can’t make useful predictions, why choose YOLO?
I suggest you need to start further upstream with the business/medical objective. Are you just trying to classify an image? Are you trying to segment the image? Trying to measure the size of an anomaly? Is runtime throughput essential? That will drive the algorithm preference.
Remember that YOLO was designed to be competitive in accuracy while optimizing on runtime throughput. It trades off training time computation and complexity to achieve that single pass prediction at high frame rates. If you don’t need that kind of runtime characteristic (and I don’t know of any medical imaging scenario that does) you could choose a different object detection algorithm that might prove simpler to work with than YOLO. NOTE that if object extent (not just center position) is part of your desired output then somehow you need to provide an objective function and data upon which learning optimization can be performed.
Welcome further thoughts and discussion
ps: I think 100 images is an extremely small dataset on which to try to train YOLO, which has millions of trainable parameters.
@ai_curious thanks for you answer. Yes, you made me aware that perhaps using a different object detection algorithm might be the way to go. Since I’m starting in the computer vision field, YOLO has been the first detection algorithm I have been exposed to, but I will start looking into others for my use case.
Let me describe better my scenario and perhaps you could recommend me something that can achieve this task. I have 100 images as previously said and each of those images have many cells which have been annotated by their center position only, together with th x,y coordinates of the center position there is an associated class label for the cell type. So in summary , one image may have lots of different class labels in them, which made me think a classical CNN architecture might not be the best case since all examples I have been exposed so far have only one class label per image.
In total out of the 100 images, I have around 23k class labels and their respective center positions. Thus in some sense, the dataset gets larger when considering that there are many annotated cells within one single image.
What I was wondering is if there is any algorithm out there that allows me to detect the center position of a cell say with a buffer of 10 pixels around the center would still be fine as a valid detection prediction and at the same time also give me a class label.
I’ll continue to think about this, but unless you are confident that each location is of uniform size a priori occurs to me that you might benefit from a two phase, kind of hybrid process. The first uses a form of image segmentation to expand from the center coordinates to the regions around them that are not background. This could be done by thresholding or region merge approaches. Once obtained, the regions could then serve as bounding boxes for a second, supervised, object detection approach.
It still isn’t clear to me what the business or medical objective is. Do you need to predict center location and label and shape(size)? If you just need center and label, you can use the data you have, but you have to change the network output shape. By default, YOLO output is (S*S*B*(1+4+C)) where the 4 comes from b_x, b_y, b_w, b_h but if you don’t care about shape you could leave it out altogether. Need corresponding change to omit from the loss function.