Suppose I have a lot of merchant names from bank transactions like Waitrose store, Sainsbury supermarket, Trainline, Apple subscription. I want to train a classifier to predict the category of the merchant, for example, restaurants, transportation, groceries, and I do have the labels.
Would it be better to train a sequential model on my own… or to use universal sentence embedding from Google that gives me an embedding representation of the sentence (the whole merchant name) and then use that embedding to train the classifier?
My question arises because merchant names are a very specific subset of the English language that have a lot of proper nouns and the sentences are quite short, usually from 1 to 4 words.
Of course, the answer is “it depends”.
First of all, it depends on your dataset size. If it’s small, you’re most probably better of with Google, if it’s big enough, then comes the second question.
Second, if the names contain enough signal or predictive power and how important is the confusion matrix (precision/recall). Most probably you would go with Google representation and your classifier head here anyways, but if your “names” are very specific to your domain (for example, “waitroses” start to have meaning and predictive power) and the dataset is large enough, then there’s a chance that training your whole pipeline would lead to better solutions.
In other words, it’s hard to predict beforehand but I would guess that fine-tuning would have better results (unless you have a large dataset and the predictive power is not in the common words but in your dataset biases and “names” themselves).
oh, thank you! Indeed, my dataset is large, around 20 million transactions with around 1 million different merchant names. Since it is a multi class problem, I am considering a weighted average f1, so both recall and precision are important but classes with more samples are more important.
Then it’s certainly big enough . How many classes are there? Are the labels single class or multiple at the same time? Have you tried a simple solution (like Naive Bayes) as a baseline?
There are 50 classes, they are single class. I tried Google’s encoding + xgboost and got around 70% of weighted precision.
Hmmm… I forgot to mention something. many times the merchant names are incomplete because of the way some banks handle the merchant names, so we have something like: Waitrose superm , Waitrose, Waitrose sup, Waitrose Supermarket, Waitrose store, Waitrose sto… and also spaces can be missing sometimes, so we can have something like: Waitreosesuper, waitrosesup, waitrosesupermarket…
In that case, would it be better to use something that is based on characters and not words like Facebook Fast Text?
And, if I make my own sequential model, I should consider that tokens are characters and not words, right?
In my experience, yes, the character level models for this type of modeling (short sequence classification) should work best. And I did use my own tokenization (a certain subset of uft-8 characters) and shallow embedding (3-10 embedding size) produced better results than any pre-trained encoding models. In other words, I believe you would see better results straight away when you implement your own entire pipeline.
On the other hand, depending on your application and the actual difference in the names, the pre-processing and data cleaning could boost the performance a lot.