A large language model at character level

Hello everyone,

I’m currently working with bank transactions, and I’ve encountered merchant names like “Tesco Super”, “Tesco Superma”, “Paypal * Tescosupermar”, and “Zilch * tescosupermarket - 343”, which have very similar meanings.

Currently, I’m using a sentence embedding based on the average of words (universal sentence embedding) to classify these transactions into categories like supermarkets, hotels, and restaurants. While it works reasonably well, I believe there’s room for improvement. Hence, I’m interested in creating a sentence embedding based on the average of character embeddings.

Are there any language models in Large Language Models (LLMs) that have been trained on character tokens and from which I can extract an embedding combining the meanings of characters? I’m aware of token tokenizers, but I’m specifically looking for models compatible with a character-level tokenizer.

I know there are models like CharBERT and CharacterBERT, but they still use word tokenizers, where each word’s meaning is based on characters. However, for my purposes, the space " " is just another character and does not always denote a new word. Additionally, characters like “*” carry significant meaning in merchant names.

Thank you for your insights!