Recognition of sensitive data

I have a small crysis of ideas. The first is connected with what tools to choose and the second is about how to prompt the llm if to use the approach that I’ve already chosen.

So, my current task is to recognize different sensitive data that can appear in a messenger chat. Additionally I should say, that I don’t have comprehensive dataset with NER tags, that contain this sensitive data.

So, my decision was to use wonderful python lib — Faker. I really think, that DeepLearning should to include at least basic info about this instrument to one of its courses.

This is my code how I generate a data, and you also can see what types of a sensitive data are:

from faker.exceptions import UniquenessException
from faker import Faker, config

locales = config.AVAILABLE_LOCALES

contact_type = {
    'unrecognized': 'O',
    'name': 'N',
    'corporate_info': 'C',
    'phone': 'P',
    'email': 'E',
    'url': 'U',
    'address': 'A',
    'bank': 'B',
    'bank_iban': 'B_I',
    'bank_swift': 'B_S',
    'card_number': 'C_N',
    'card_expire': 'C_E',
    'card_cvv': 'C_C',
}


def generate_training_data():
    def add_words(word_gen, tag):
        if word_gen == '':
            return

        try:
            words_list = word_gen().split()
            words = ' '.join(words_list) + '\n'
            sentences_file.write(words)
            s = f'{tag.upper()} ' * len(words_list)
            tags_file.write(s.rstrip() + '\n')

        except UniquenessException:
            pass

    with open(GENERATED_SENTENCES_PATH, 'w') as sentences_file, open(TAGS_PATH, 'w') as tags_file:

        for locale in locales:
            fake = Faker(locale)
            for _ in range(999):
                [add_words(word_gen, tag) for word_gen, tag in
                 [(getattr(fake.unique, "phone_number", ''), contact_type['phone']),
                  (getattr(fake.unique, 'email', ''), contact_type['email']),
                  (getattr(fake.unique, 'uri', ''), contact_type['url']),
                  (getattr(fake.unique, 'domain_name', ''), contact_type['url']),
                  (getattr(fake.unique, 'address', ''), contact_type['address']),
                  (getattr(fake.unique, 'name', ''), contact_type['name']),
                  (getattr(fake.unique, 'company', ''), contact_type['name']),
                  (getattr(fake.unique, 'bank-iban', ''), contact_type['bank_iban']),
                  (getattr(fake.unique, 'swift', ''), contact_type['bank_swift']),
                  (getattr(fake.unique, 'credit_card_number', ''), contact_type['card_number']),
                  (getattr(fake.unique, 'credit_card_expire', ''), contact_type['card_expire']),
                  (getattr(fake.unique, 'credit_card_security_code', ''), contact_type['card_cvv'])
                  ]]

So, after I’ve generated sensitive data and corresponding tags, I put them to tensorflow model like this

model = keras.Sequential([
    tf.keras.layers.Embedding(36618, 50),
    tf.keras.layers.Dense(len(tag_vectorizer.get_vocabulary())),
    tf.keras.layers.Dense(len(tag_vectorizer.get_vocabulary())),
    tf.keras.layers.Softmax()
])

It works more-less fine with accuracy 0.9 by tensorflow metrics when I input single words or combination of words that can be any tag type. And of course it works bad when I use in input of the predictions making words that are from Unrecognized tag, because I haven’t used any of them during training. And it works bad when I input whole sentence on input. The model tries to give the same tag to all input words.

Also I tried to fine tune distelbert-base-uncased huggingface model. I hoped, that pretrained models have some knowledge about different words by itself and I will just learn them that any of words should have tags like any from my list, but unfortunately, I’ve got the same result as in my minimalistic tensorflow model.

Finally, my decision now is to improve situation by generation context words with usage of llms. I’ve chosen Mistral. Fortunately, there is opportunity of free deployment of it on Cloudflare and I can deploy it localy in one of my laptops. But it is not easy task to generate context words too. This is example of input that I try to send to LLM:

We work on NER task. We have tags that are just like values in this python 
dict and keys are their description:
{
    'unrecognized': 'O',
    'name': 'N',
    'corporate_info': 'C',
    'phone': 'P',
    'email': 'E',
    'url': 'U',
    'address': 'A',
    'bank': 'B',
    'bank_iban': 'B_I',
    'bank_swift': 'B_S',
    'card_number': 'C_N',
    'card_expire': 'C_E',
    'card_cvv': 'C_C',
}

I have just generated words that are one of this tags. For example this:
(694)664-7912
srajmran@example.net
https://bnw.com/posts/category/blog/homepage.htm
belbky.com
03700 الخلفاوي Ports Apt. 055 New زكيfurt, MT 17598
السيدة دارين بنو الدئل
ابو السعود-الزرقان
(407)335-0361
zalmfty@example.net
http://www.alewalq-jefr.com/search/main/categories/category.php
ykr-mhj.com
USS الجعليين FPO AE 01029
الدكتور عبد الباقي ترابين
السمان-البخاري
001-818-778-8495x075
altwtnjyebd-alhlym@example.org
http://www.bnw.com/register.html
almwrkp.net
3231 راوي Groves بنو صخرtown, PW 47842
نصر الدّين بلي
الكبابيش, القباني and بيروتي
457-350-7524
hyrhrb@example.org
https://www.enzp.biz/category/main/index.php
hmydp.com
01242 حافظ Cliffs South رشدي, ND 92777
سهل مزرعاني
بكر بن عبد مناة, الزماميري and زلاطيمو
265-327-0806
whmydp@example.com

Provided words match this tags per string:
P
E
U
U
A A A A A A A A A A
N N N N
N N
B_S
C_N
C_E
C_C
P
E
U
U
A A A A A A A A A
N N N
N N
B_S
C_N
C_E
C_C
P
E
U
U
A A A A A A A
N N
N N
B_S

Your task is to generate on output full sentence in which this provided generated words per string can be. This added by you 
words shouldn't be like any of tags from dict, except 'unrecognized': 'O'. So, with generated sentences I wait a list of
tags, that match your generated sentences. It means that all provided words by you should match just 'O' tag. 

Example of what I can expect from you in output for first 4 provided sentences:
Phone me here, please, (694)664-7912
We will write a message to your email srajmran@example.net, just wait
Here https://bnw.com/posts/category/blog/homepage.htm you can learn more about your topic
This link belbky.com contains more details of our company, if you are interested

O O O O P
O O O O O O O O E O O
O U O O O O O O O
O O U O O O O O O O O O O

Generate me sentences for all 30 strings of provided words that can be met in real life the same as in example. 
And generate me another sentences for first 4 words that I provided as example. I wait your output in format the same as provided in
example: first list contains generated by you real life sentences and second list after empty new string - matching tags.

It works really bad. You can try in GPTChat by yourself.
The most successful prompt was this, but it resolves just part of task:

Create a unique sentence for all of the following lines. Make sentences short but meaningful.
Lines:
001-988-283-7608
mxtar46@example.net
http://www.bnw.biz/wp-content/faq.html
aljaewny-althan.com
690 تاج الدّين Bypass Suite 014 بنو ياسfurt, MO 81906
الآنسة راما بنو عجل
الشرفاء PLC
SPTDGBIF
3501401856545866
05/27
7900

Maybe I can generate unique sentences by myself, label all words as Unknown and just replace by python that Unknown tags which in the same position as in the generated sentence and the same words as in input. But it is not resolve the problem that any of generated words can be any of tags that are not Unknwon, but will be labeled as Unknown. Also, by my small experience, LLM can’t very understand, that it should generate completely different sentences, and there are a lot of duplicated contexts which not very differ.

So, can you help me please with any ideas how can I improve prompt to LLM or maybe to change an approach to resolve the task?