A large language model at character level

Mauricio_Toro · April 29, 2024, 10:27am

Hello everyone,

I’m currently working with bank transactions, and I’ve encountered merchant names like “Tesco Super”, “Tesco Superma”, “Paypal * Tescosupermar”, and “Zilch * tescosupermarket - 343”, which have very similar meanings.

Currently, I’m using a sentence embedding based on the average of words (universal sentence embedding) to classify these transactions into categories like supermarkets, hotels, and restaurants. While it works reasonably well, I believe there’s room for improvement. Hence, I’m interested in creating a sentence embedding based on the average of character embeddings.

Are there any language models in Large Language Models (LLMs) that have been trained on character tokens and from which I can extract an embedding combining the meanings of characters? I’m aware of token tokenizers, but I’m specifically looking for models compatible with a character-level tokenizer.

I know there are models like CharBERT and CharacterBERT, but they still use word tokenizers, where each word’s meaning is based on characters. However, for my purposes, the space " " is just another character and does not always denote a new word. Additionally, characters like “*” carry significant meaning in merchant names.

Thank you for your insights!

Topic		Replies	Views
A Character Based Language Model NLP with Attention Models week-4	1	241	May 1, 2024
[ELI5] What is embedding? Generative AI with Large Language Models week-1	5	516	December 5, 2023
My own sequential model Vs universal sentence encoding NLP with Sequence Models week-1	5	257	February 14, 2024
A general question about LLM tokenization Generative AI with Large Language Models week-2	7	332	December 14, 2023
The uses of Tokenizer Generative AI with Large Language Models week-1	1	378	October 2, 2023

A large language model at character level

Related topics