Eponec - A grammar and programming tool for guiding LLMs

chumpro · June 26, 2024, 3:25am

Hi, I am doing a project where I create parser that can be used to guide LLMs.

Here is a link if you like to check it out:

I have two, possibly related, problems. I am using huggingface transformers and TinyLlama/TinyLlama_v1.1. I’m on a laptop with no GPU.

Problem 1: I want to have a dictionary of token_string → token_id, where the token_string has the tokenizers special characters like space and newline translated into the normal ’ ', ‘\n’. I’m doing regex matching against the token strings. Currently I am achiving this by using the following function:

def fix_token( s ) :
	return s.replace( chr( 9601 ), ' ' ).replace( '<0x0A>', '\n' )

I’m thinking there is a better way to achieve this by using the tokenizer.

I’ve tried tried the following :

s1 = tokenizer.convert_ids_to_tokens( [ 29504 ] )

print( 's1', s1 )

s2 = tokenizer.convert_tokens_to_string( s1 )

print( 's2', [ s2 ] )

s3 = tokenizer.decode( [ 29504 ] )

print( 's3', [ s3 ] )

which outputs:

s1 ['▁terminate']
s2 ['terminate']
s3 ['terminate']

Strangely convert_tokens_to_string handles newline “correctly”. The output I am hoping for starts with an ordinary space like this:

[' terminate']

Problem 2: The second problem is that ‘’ sometimes encodes to something that decodes to ' ', that is adding a succeeding space.
Notice that I am decoding the output_ids of model.generate ( to use for regex pattern matching ) and then I encode it again so that
the model get the expected tokenization of the string.

I have the following ( the text prompt in 's and input_ids is the encoding ):

[' <|system|>\n']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13])

[' <|system|>\nYou are an AI assistant.']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13,  3492,   526,   385,
	      319, 29902, 20255, 29889])

[' <|system|>\nYou are an AI assistant.</s>']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13,  3492,   526,   385,
	      319, 29902, 20255, 29889,     2])

[' <|system|>\nYou are an AI assistant.</s><|user|>\n']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13,  3492,   526,   385,
	      319, 29902, 20255, 29889,     2,   529, 29989,  1792, 29989, 29958,
	       13])
> hi

[' <|system|>\nYou are an AI assistant.</s><|user|>\nhi']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13,  3492,   526,   385,
	      319, 29902, 20255, 29889,     2,   529, 29989,  1792, 29989, 29958,
	       13,  2918])

[' <|system|>\nYou are an AI assistant.</s><|user|>\nhi<|assistant|>\n']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13,  3492,   526,   385,
	      319, 29902, 20255, 29889,     2,   529, 29989,  1792, 29989, 29958,
	       13,  2918, 29966, 29989,   465, 22137, 29989, 29958,    13])

[' <|system|>\nYou are an AI assistant.</s> <|user|>\nhi<|assistant|>\ni']
input_ids tensor([29871,   529, 29989,  5205, 29989, 29958,    13,  3492,   526,   385,
	      319, 29902, 20255, 29889,     2, 29871,   529, 29989,  1792, 29989,
	    29958,    13,  2918, 29966, 29989,   465, 22137, 29989, 29958,    13,
	    29875])

Notice that ‘’ has token id 2. Up until the last printout the 2 is followed by 529, the beginning of the sequence ‘<|user|>\n’. Then all of a sudden at the last step, a 29871 is inserted between the 2 and the 529. The code that is running here is eponec.Parser.generate with an extra printout of input_ids. What I’m trying to do here is make an eponec.Parser that does chat correctly according to the chat template of the model. The extra space is causing problems for the parser, which relies on offset.

I have a feeling I’m not doing this the right way. My question is: What is happening and how to stop it from happening?

Topic		Replies	Views
Greetings AI enthusiasts! Introductions introductions	0	30	June 25, 2024
Training Process lesson - Why Tokenize two times Finetuning Large Language Models	5	165	August 28, 2023
Need help to find the syntax of transformers and tokenizers used in week 1 lab Generative AI with Large Language Models week-module-1	3	410	July 30, 2023
C4_W1_UNQ_C6 wrong ouput NLP with Attention Models week-module-1	3	522	March 26, 2023
Error running NER merge_tokens code in Colab Building Generative AI applications with Gradio	0	123	October 26, 2023

Eponec - A grammar and programming tool for guiding LLMs

Related topics