My own sequential model Vs universal sentence encoding

Mauricio_Toro · February 13, 2024, 11:56am

Hi, everyone.

Suppose I have a lot of merchant names from bank transactions like Waitrose store, Sainsbury supermarket, Trainline, Apple subscription. I want to train a classifier to predict the category of the merchant, for example, restaurants, transportation, groceries, and I do have the labels.

Would it be better to train a sequential model on my own… or to use universal sentence embedding from Google that gives me an embedding representation of the sentence (the whole merchant name) and then use that embedding to train the classifier?

My question arises because merchant names are a very specific subset of the English language that have a lot of proper nouns and the sentences are quite short, usually from 1 to 4 words.

arvyzukai · February 13, 2024, 1:53pm

Hi @Mauricio_Toro

Of course, the answer is “it depends”.

First of all, it depends on your dataset size. If it’s small, you’re most probably better of with Google, if it’s big enough, then comes the second question.

Second, if the names contain enough signal or predictive power and how important is the confusion matrix (precision/recall). Most probably you would go with Google representation and your classifier head here anyways, but if your “names” are very specific to your domain (for example, “waitroses” start to have meaning and predictive power) and the dataset is large enough, then there’s a chance that training your whole pipeline would lead to better solutions.

In other words, it’s hard to predict beforehand but I would guess that fine-tuning would have better results (unless you have a large dataset and the predictive power is not in the common words but in your dataset biases and “names” themselves).

Cheers

Mauricio_Toro · February 13, 2024, 3:34pm

oh, thank you! Indeed, my dataset is large, around 20 million transactions with around 1 million different merchant names. Since it is a multi class problem, I am considering a weighted average f1, so both recall and precision are important but classes with more samples are more important.

arvyzukai · February 13, 2024, 5:21pm

Then it’s certainly big enough . How many classes are there? Are the labels single class or multiple at the same time? Have you tried a simple solution (like Naive Bayes) as a baseline?

Mauricio_Toro · February 14, 2024, 9:10am

There are 50 classes, they are single class. I tried Google’s encoding + xgboost and got around 70% of weighted precision.

Hmmm… I forgot to mention something. many times the merchant names are incomplete because of the way some banks handle the merchant names, so we have something like: Waitrose superm , Waitrose, Waitrose sup, Waitrose Supermarket, Waitrose store, Waitrose sto… and also spaces can be missing sometimes, so we can have something like: Waitreosesuper, waitrosesup, waitrosesupermarket…

In that case, would it be better to use something that is based on characters and not words like Facebook Fast Text?

And, if I make my own sequential model, I should consider that tokens are characters and not words, right?

arvyzukai · February 14, 2024, 10:31am

In my experience, yes, the character level models for this type of modeling (short sequence classification) should work best. And I did use my own tokenization (a certain subset of uft-8 characters) and shallow embedding (3-10 embedding size) produced better results than any pre-trained encoding models. In other words, I believe you would see better results straight away when you implement your own entire pipeline.
On the other hand, depending on your application and the actual difference in the names, the pre-processing and data cleaning could boost the performance a lot.

Topic		Replies	Views
A large language model at character level AI Discussions	0	113	April 29, 2024
A Character Based Language Model NLP with Attention Models week-4	1	241	May 1, 2024
Assignment 3: Question Duplicates - Exercise 4: Classify NLP with Sequence Models week-3	4	492	February 4, 2024
[Week 4] Transformer Network Application: Named-Entity Recognition Sequence Models	11	791	July 21, 2021
Merchant Classification in Credit Transactions using Sentence Embeddings AI Discussions ai-discussions	1	93	May 16, 2023

My own sequential model Vs universal sentence encoding

Related topics