We have to perform a classification problem with machine learning techniques.
Our dataset is composed in this way:
Fornitore | Codice correlazione | Esenz. IVA | IVA indetraibile | conto_des | |
---|---|---|---|---|---|
1 | 6.01122e+07 | 5.RATA PROG.STRUMENTI DIGITALI X RETE ITALYTYRE | nan | No | 10.200.033 LICENZE DI PROGRAMMI SOFTWARE |
2 | 6.01122e+07 | SPESE INCASSO | nan | No | 70.901.002 |
3 | 6.01108e+07 | MICRO SWITCH | nan | No | 70.400.001 MATERIALI DI MANUTENZIONE |
4 | 6.01123e+07 | HUB ESOLVER | nan | No | 10.200.033 LICENZE DI PROGRAMMI SOFTWARE |
5 | 6.01117e+07 | 215/55R18 PRIMACY 4 S1 CORD. | nan | No | 70.200.051 MERCI C/ACQUISTI |
The features are:
- Fornitore
- Codice correlazione
- Esenza. IVA
- IVA indetraibile
The target variable is conto_des
The goal is to retrieve conto_des given “Codice correlazione”. As you can see, the most important feature (Codice correlazione) is a text, it could be also a free text: i.e. the user could type Micro Switch Button instead of Micro Switch and the model should answer always with 70.400.001 MATERIALI DI MANUTENZIONE.
I’ve encoded all text to float number and for “Codice correlazione” i used TfidfVectorizer, which produce an array of float, and then for each float i added a column.
In this way i have all numbers:
Fornitore | Esenz. IVA | IVA indetraibile | conto_des | cod_corr_0 | cod_corr_1 | cod_corr_2 | cod_corr_3 | cod_corr_4 | cod_corr_5 | cod_corr_6 | cod_corr_7 | cod_corr_8 | cod_corr_9 | cod_corr_10 | cod_corr_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.01122e+07 | 7233902 | 0 | 10.200.033 LICENZE DI PROGRAMMI SOFTWARE | 0.41074 | 0.35631 | 0.419624 | 0.419624 | 0.419624 | 0.419624 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 6.01122e+07 | 7233902 | 0 | 70.901.002 | 0.820131 | 0.572175 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6.01108e+07 | 7233902 | 0 | 70.400.001 MATERIALI DI MANUTENZIONE | 0.674284 | 0.738472 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 6.01123e+07 | 7233902 | 0 | 10.200.033 LICENZE DI PROGRAMMI SOFTWARE | 0.707107 | 0.707107 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 6.01117e+07 | 7233902 | 0 | 70.200.051 MERCI C/ACQUISTI | 0.447214 | 0.447214 | 0.447214 | 0.447214 | 0.447214 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The problem is that even in this way there is no correlation with features and target variable.
I was thinking using XGBoost as ML model, but may be it’s not the right approach.
Can you suggest a better way (Neural Network?) to perform this task?
Thank’s in advance