Hello, I am new to the world of AI, and I am doing a small classification project, I am using a boosted tree, the problem is that it is dedicated to classifying businesses to see if they are acts to enter a database, it also does a call to an api so I have input data that sends me “inputName” and I also have “OutputName” in addition to the web, I have the tabular data in an excel and I preprocess it with Pandas, but I find a problem and it is. …
How do I encode business names?
The problem is the dimensionality so One Hot Encoding would not work, I have seen about Hashing the inputName, but having an outputName I do not know if by hashing it you will be able to understand the model that has a high and significant relationship with the output result, that is, If the input and output look similar there is a higher probability that it is true, I also have to deal with problems of different types of languages such as writing in Cyrillic or Arabic, I have thought about using transliteration, but I am a bit lost.
I have around 30 thousand tabulated data, and I am training a boosted tree, although I am also doing small tests with DNN.
What do you recommend?