I have an idea for an ML model to explore building using an open data set. It would be a developer salary predictor.
The source data has several questions that are answered as mutli-select answers. For example, programming language. As you can imagine there are a lot of potential answers.
My first question is just a question of terminology. Let’s say there were just four: Java, Python, JavaScript, and Rust.
In the course description of one-hot encoding, it is characterised as more of a “pick-one” type of categorical data. As in your ears are either pointy, floppy, or round.
With the programming language data set, a developer will often have several of the above answers.
So in that instance, where one option is not mutually exclusive of the other, is this still technically considered “one-hot” encoding?
Second part of my question has to do with complexity. There could easily be fifty answers that are recorded across the entire data set for programming languages. In a one-hot encoding situation, that means I have to engineer fifty features. Is there a more simple way of doing this? Or am I just running into what I’ve heard is often the bulk of the work in ML: data engineering. Especially when I expand this effort across what are probably twenty other fields that are also multi-select (like which platforms do you use, which developer tools do you use, etc.). Are there any libraries that help with this kind of data engineering work? I’m sure I can come up with an algorithm to scan all responses of a given column and transform all of those responses into the data that goes into a whole new set of one-hot features based on all of the responses based on all of the multi-select features. But that sounds daunting!
Thanks in advance for answers!
On the plus side, as I was learning about this data set, I had intuitively worked out that I probably needed to do some feature engineering for these multi-selects already in the form of one feature per response value. Then a few days later, I ended up getting to the one-hot encoding of categorical features lesson, which confirmed my intuition. Cool!