Landmark detection: reusing data in transfer learning

I did this course to better understand how I can improve on my own project where i want to estimate the dimensions of an ear based on a picture. my current approach is a landmark detection algorithm, does anyone have any thoughts on using transfer learning using the pretrained layers from GitHub - kbulutozler/ear-landmark-detection-with-CNN: A tool to detect 55 landmark points on a given ear image., then train my last layers with the a sample of the same data which has been relabelled to only the landmarks I’m interested in? or do I need to use different data?

This is off-topic i know, but i would really appreciate any thoughts you have - thank you!

It would be helpful to understand more about this problem statement. What dimensions are you estimating? And what do you intend to do with that information. That is, are you looking merely at bounding box (perimeter)? Detailed structure? Specific features as predictors of other conditions or syndromes etc

Hi, thank you for your reply @ai_curious. My goal is to classify what sort of earphones would be suitable for a person. they should take a set of photos of their ear with a phone, the algorithm should detect certain features and from that I can measure the dimensions of the ear and specify what earphone size would be suitable (distance between tragus and antihelix for example would indicate how wide the earphone can be).

for data i have the data i linked (605 labelled google images) and I have my own set of images (53 images with only labels that I need and better represent what the algorithm should estimate).

Can you also elaborate on this? Is classification done using a simple characteristic like small-medium-large or a brand or model? Or are you trying to achieve a more precise measurement between specific landmarks?

If the former, I wonder if you can let the model learn features instead of human engineering the landmarks. That is, you provide the training set that says ‘all these ear examples have the label small, and all these examples have the label medium‘. If you’re talking about IEM that involves canal shape, it could extend to a finer gradation of shape and size, but here your limited data set will hurt generalizability.

Ah that is interesting, I want to specify 6 parameters: S/M/L of ear tip and of retention piece, placements of both and orientation of both - would your idea still be good for this or would it mean I need more data since there are so many classifications?

A CNN with adequate data could definitely learn to predict 6 values. It’s basically like bounding box coordinates plus object classification. Hundreds of images is on the low side of what I would expect to work. Tens of images is really pushing it. Especially when you think about reserving some for test. Are you going to train on 40 and test on 13? You would ideally want order(s) of magnitude more. But you can find out by experimenting. I would look for a fairly simple object detection architecture from the web and see how it does with your label values and types. You’ll need to digitize the S/M/L values. Hope this helps.

PS you could also end up doing transfer learning with one model architecture trained to do well on sizes merged with one trained to do placement and orientation. Might work better than training one architecture on everything at the same time.

That’s a great idea, thank you. to directly assess these 6 parameters, which are dependent on some depth estimation, 3+ images might be needed of the same ear, can I input these images together as 1 data example to the algorithm, perhaps stacking them?

Also, since locations is something i want to predict, can the labels be a mix of classification and predicting coordinates?

The answer to this one is straightforward: yes. Object detection algorithms do it exactly that by predicting location (where is it) and classification (what is it). You just need an approach to digitize the categorical variable so prediction error ( loss) can be computed. Search on categorical encoding for alternatives.

That I don’t know. My first answer is ‘no’. At least that is not a standard approach to using deep learning, which normally assumes inputs are independent. How does a human use the multiple images? Some values come from one image but others from a different image? How does a human decide which to take from which image? That is, is there a learnable quality metric for example some images (perspective) better for some measurements and if so, how does a human learn that? Is there averaging or some other approach to combining information from multiple images into a single output? Is there a uniform number of images per ear, or it varies? Sorry for the lack of domain knowledge.

wow, youre asking some good questions here, youre being very helpful! I thought maybe the input could be 4D, but if it isnt a known method then a rookie like me might not be successful with this approach.

A professional would look, if knowing the scale, at off angles to look past the Tragus for example to see the approximate radius and have some idea about the using a S/M/L tip. or, see that the dimensions of the ear is big overall looking straight at it, so the the retention piece needs to be large. the professionals who will label this data for me are used to working with digital imprints of someones ear and have a sense of what dimensions will be comfortable for someone without the earbud falling out. also its worth noting that I can accept a relatively high error.

yes, no not all images would give information about all parameters

if they can see a specific feature (like ear canal) then they can make some assessment for it. so for my nn, perhaps first verify what features are in the image then make a prediction for the parameters relevant to those features?

no it would rather be assessing distances between visible features in each image

This is what I’m aiming for.

I do think you could train a CNN to detect and predict the angle at which an image was captured (ie orthogonal to subject vs camera to rear or forward of normal) which might help it also decide which measurements are more reliable from that particular perspective. I don’t have a good idea off the topof my head how to ‘bundle’ sets of images. At training time maybe it doesn’t matter…you just train the model to predict all values from all images. At prediction runtime time you pass the images of a set one at a time through the NN and collect specific values from the image that is ‘best’ for those values. Understand this is just me kind of making it up on the fly. Next I would want to do more literature search and see if there was prior art I could steal from. Hope this helped a little.

ok, thats useful advice - youur thoughts have been invaluable, thank you very much. I hope you have a great week.