I don't know how to code this feature

Hello, I am new to the world of AI, and I am doing a small classification project, I am using a boosted tree, the problem is that it is dedicated to classifying businesses to see if they are acts to enter a database, it also does a call to an api so I have input data that sends me “inputName” and I also have “OutputName” in addition to the web, I have the tabular data in an excel and I preprocess it with Pandas, but I find a problem and it is. …
How do I encode business names?
The problem is the dimensionality so One Hot Encoding would not work, I have seen about Hashing the inputName, but having an outputName I do not know if by hashing it you will be able to understand the model that has a high and significant relationship with the output result, that is, If the input and output look similar there is a higher probability that it is true, I also have to deal with problems of different types of languages such as writing in Cyrillic or Arabic, I have thought about using transliteration, but I am a bit lost.

I have around 30 thousand tabulated data, and I am training a boosted tree, although I am also doing small tests with DNN.

What do you recommend?

1 Like

Hello @Sir_Icebreaker

You haven’t mention on what basis you are wanting to classify these business names.

Based on your explanation, I wanted to know your excel file is CSV file?

Can I know what issue is with dimensionality for one hot encoding.

Language style could be dealt with addition as one of the features.

You also need to give some information about the API you are using for your input and output business name.

Kindly elaborate this part a little more for better understanding of your issue.

Regards
DP

You could store all of the business information in a database, and use the database keys to identify the business.

But you might also consider, is the name of the business really important in classifying their behavior? Maybe you don’t need the business names at all during training.

1 Like