What counts as one-hot encoding

I have an idea for an ML model to explore building using an open data set. It would be a developer salary predictor.

The source data has several questions that are answered as mutli-select answers. For example, programming language. As you can imagine there are a lot of potential answers.

My first question is just a question of terminology. Let’s say there were just four: Java, Python, JavaScript, and Rust.

In the course description of one-hot encoding, it is characterised as more of a “pick-one” type of categorical data. As in your ears are either pointy, floppy, or round.

With the programming language data set, a developer will often have several of the above answers.

So in that instance, where one option is not mutually exclusive of the other, is this still technically considered “one-hot” encoding?

Second part of my question has to do with complexity. There could easily be fifty answers that are recorded across the entire data set for programming languages. In a one-hot encoding situation, that means I have to engineer fifty features. Is there a more simple way of doing this? Or am I just running into what I’ve heard is often the bulk of the work in ML: data engineering. Especially when I expand this effort across what are probably twenty other fields that are also multi-select (like which platforms do you use, which developer tools do you use, etc.). Are there any libraries that help with this kind of data engineering work? I’m sure I can come up with an algorithm to scan all responses of a given column and transform all of those responses into the data that goes into a whole new set of one-hot features based on all of the responses based on all of the multi-select features. But that sounds daunting!

Thanks in advance for answers!

On the plus side, as I was learning about this data set, I had intuitively worked out that I probably needed to do some feature engineering for these multi-selects already in the form of one feature per response value. Then a few days later, I ended up getting to the one-hot encoding of categorical features lesson, which confirmed my intuition. Cool!

No. Each of the programming languages can be a separate true/false feature.


Hello @pchittum,

They are not one-hot.

You may just keep all of them.

You may consider to just group minor languages.

You may consider to regroup all languages with some criteria.

You may consider to use target encoding.

You may convert the languages (of 50 different kinds) into a set of, say, 10 different features which describe the languages. Such features could be “popularity among employees”, “popularity among employers”, …

You may convert the languages (of 50 different kinds) with PCA into a smaller set.

There could be more ways…

Each approach has their ups and downs. While one-hot is a one-click method, others require effort to find out the best configuration that trades off those ups and downs. So, yes, engineering.
