Feature Selection for unstructured data?

Can we apply the Feature Selection on unstructured data? Does it make sense?

So for imagery, they are typically represented in their pixel form, or more commonly nowadays in embeddings. Does it make sense to perform feature selection on embeddings?

Hi @jax79sg

I have worked a lot on Computer Vision.

Normally on images you work on the entire image, but very often you need to do some preprocessing, depending on the source image and the result you need to achieve.
Just to give you some examples:

  • in Healthcare, X-rays are normally provided as DICOM images, at very high resolution (greater than 4000x4000). Even if you have a machine with 8 GPU, or you’re using TPU, it is very slow to work with very large images. So, you have to convert from DICOM to JPEG and then resize.
    To have an idea of what are the common approaches, you could have a look at the Kaggle’s competitions.

  • Embeddings: well embeddings are mostly used in NLP, since there you need always to translate the words in numbers.
    You could get “embeddings” for images, using the output of one internal layer of a NN… but the network is generally trained on the downstream task and, in my opinion, after you get the embeddings you’ll lose information if you do dimensionality reduction.

Anyway, there is not a general criteria always valid. At the end, you could try and see… in some cases, if the number of images is very small and your original model overfit badly… you could get some small improvements reducing.
But these are subjects best explored in other specializations, since very close to Computer Vision.

Hope to have given some good ideas.

1 Like