A Character Based Language Model

Mauricio_Toro · May 1, 2024, 2:04pm

Hello everyone,

I’m currently working with bank transactions, and I’ve encountered merchant names like “Tesco Super”, “Tesco Superma”, “Paypal * Tescosupermar”, and “Zilch * tescosupermarket - 343”, which have very similar meanings.

Currently, I’m using a sentence embedding based on the average of words (universal sentence embedding) to classify these transactions into categories like supermarkets, hotels, and restaurants. While it works reasonably well, I believe there’s room for improvement. Hence, I’m interested in creating a sentence embedding based on the average of character embeddings.

Are there any language models in Large Language Models (LLMs) that have been trained on character tokens and from which I can extract an embedding combining the meanings of characters? I’m aware of token tokenizers, but I’m specifically looking for models compatible with a character-level tokenizer.

I know there are models like CharBERT and CharacterBERT, but they still use word tokenizers, where each word’s meaning is based on characters. However, for my purposes, the space " " is just another character and does not always denote a new word. Additionally, characters like “*” carry significant meaning in merchant names.

Thank you for your insights!

Deepti_Prasad · May 1, 2024, 4:07pm

Hi @Mauricio_Toro

I found the below link based on your requirement

Chars2vec: character-based language model for handling real world texts with spelling errors and human slang | by Intuition Engineering | HackerNoon.com | Medium.

Although I feel you should explore the hugging face website, you will find many text based models and variation.

Regards
DP

Topic		Replies	Views
A large language model at character level AI Discussions	0	121	April 29, 2024
Converting line to tensor by characters instead of words NLP with Sequence Models week-module-2	1	495	April 1, 2023
Can an alphabetical vectorisation be created as base model to create word prediction like a dictionary of any word? NLP Resources mentor	10	141	April 10, 2024
My own sequential model Vs universal sentence encoding NLP with Sequence Models week-module-1	5	265	February 14, 2024
How to Simultaneously Use Sentence, Character, and Word Tokenization in AI Models AI Discussions ai-discussions , project , ai-question	0	88	June 13, 2024

A Character Based Language Model

Related topics