ML project model suggestion

We have to perform a classification problem with machine learning techniques.
Our dataset is composed in this way:

Fornitore Codice correlazione Esenz. IVA IVA indetraibile conto_des
1 6.01122e+07 5.RATA PROG.STRUMENTI DIGITALI X RETE ITALYTYRE nan No 10.200.033 LICENZE DI PROGRAMMI SOFTWARE
2 6.01122e+07 SPESE INCASSO nan No 70.901.002
3 6.01108e+07 MICRO SWITCH nan No 70.400.001 MATERIALI DI MANUTENZIONE
4 6.01123e+07 HUB ESOLVER nan No 10.200.033 LICENZE DI PROGRAMMI SOFTWARE
5 6.01117e+07 215/55R18 PRIMACY 4 S1 CORD. nan No 70.200.051 MERCI C/ACQUISTI

The features are:

  • Fornitore
  • Codice correlazione
  • Esenza. IVA
  • IVA indetraibile

The target variable is conto_des

The goal is to retrieve conto_des given “Codice correlazione”. As you can see, the most important feature (Codice correlazione) is a text, it could be also a free text: i.e. the user could type Micro Switch Button instead of Micro Switch and the model should answer always with 70.400.001 MATERIALI DI MANUTENZIONE.
I’ve encoded all text to float number and for “Codice correlazione” i used TfidfVectorizer, which produce an array of float, and then for each float i added a column.
In this way i have all numbers:

Fornitore Esenz. IVA IVA indetraibile conto_des cod_corr_0 cod_corr_1 cod_corr_2 cod_corr_3 cod_corr_4 cod_corr_5 cod_corr_6 cod_corr_7 cod_corr_8 cod_corr_9 cod_corr_10 cod_corr_11
0 6.01122e+07 7233902 0 10.200.033 LICENZE DI PROGRAMMI SOFTWARE 0.41074 0.35631 0.419624 0.419624 0.419624 0.419624 0 0 0 0 0 0
1 6.01122e+07 7233902 0 70.901.002 0.820131 0.572175 0 0 0 0 0 0 0 0 0 0
2 6.01108e+07 7233902 0 70.400.001 MATERIALI DI MANUTENZIONE 0.674284 0.738472 0 0 0 0 0 0 0 0 0 0
3 6.01123e+07 7233902 0 10.200.033 LICENZE DI PROGRAMMI SOFTWARE 0.707107 0.707107 0 0 0 0 0 0 0 0 0 0
4 6.01117e+07 7233902 0 70.200.051 MERCI C/ACQUISTI 0.447214 0.447214 0.447214 0.447214 0.447214 0 0 0 0 0 0 0

The problem is that even in this way there is no correlation with features and target variable.
I was thinking using XGBoost as ML model, but may be it’s not the right approach.

Can you suggest a better way (Neural Network?) to perform this task?
Thank’s in advance

2 Likes

Hi. I was working on a text clasification problem, and we used K-means with text embeddings. My texts were bigger, so they had more semantic weight. I don’t know if this strategy will work with short texts, but you can give it a try.
Mmmmmm… but we also didn’t have an specific number of groups, and you do.
Try to search something about clusters, but one process where you can define the clusters (one per class).
Other strategy could be XgBoost with embeddings.
If you don’t have too much data (i’m talking about hundred of thousands), I will not try Neural networks, because they need a lot of data to “learn”.
This is not an specific answer, but I hope it helps you a little.

2 Likes

Thank’s @haroldc
I’m trying a neural network approach using a pre-trained neural network (i’m trying fasttext). Then i make embeddings of my “Codice correlazione” column and then, with a cosine similarity, i retrieve the most similar results. All other features are used only to filter the data on which perform similiraty. It seems to work for my use case: so no need, to train a model, no need to encode the features.

1 Like