Hi, I am doing a project where I create parser that can be used to guide LLMs.
Here is a link if you like to check it out:
I have two, possibly related, problems. I am using huggingface transformers and TinyLlama/TinyLlama_v1.1. I’m on a laptop with no GPU.
Problem 1: I want to have a dictionary of token_string → token_id, where the token_string has the tokenizers special characters like space and newline translated into the normal ’ ', ‘\n’. I’m doing regex matching against the token strings. Currently I am achiving this by using the following function:
def fix_token( s ) :
return s.replace( chr( 9601 ), ' ' ).replace( '<0x0A>', '\n' )
I’m thinking there is a better way to achieve this by using the tokenizer.
I’ve tried tried the following :
s1 = tokenizer.convert_ids_to_tokens( [ 29504 ] )
print( 's1', s1 )
s2 = tokenizer.convert_tokens_to_string( s1 )
print( 's2', [ s2 ] )
s3 = tokenizer.decode( [ 29504 ] )
print( 's3', [ s3 ] )
which outputs:
s1 ['▁terminate']
s2 ['terminate']
s3 ['terminate']
Strangely convert_tokens_to_string handles newline “correctly”. The output I am hoping for starts with an ordinary space like this:
[' terminate']
Problem 2: The second problem is that ‘’ sometimes encodes to something that decodes to ' ', that is adding a succeeding space.
Notice that I am decoding the output_ids of model.generate ( to use for regex pattern matching ) and then I encode it again so that
the model get the expected tokenization of the string.
I have the following ( the text prompt in 's and input_ids is the encoding ):
[' <|system|>\n']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13])
[' <|system|>\nYou are an AI assistant.']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 385,
319, 29902, 20255, 29889])
[' <|system|>\nYou are an AI assistant.</s>']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 385,
319, 29902, 20255, 29889, 2])
[' <|system|>\nYou are an AI assistant.</s><|user|>\n']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 385,
319, 29902, 20255, 29889, 2, 529, 29989, 1792, 29989, 29958,
13])
> hi
[' <|system|>\nYou are an AI assistant.</s><|user|>\nhi']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 385,
319, 29902, 20255, 29889, 2, 529, 29989, 1792, 29989, 29958,
13, 2918])
[' <|system|>\nYou are an AI assistant.</s><|user|>\nhi<|assistant|>\n']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 385,
319, 29902, 20255, 29889, 2, 529, 29989, 1792, 29989, 29958,
13, 2918, 29966, 29989, 465, 22137, 29989, 29958, 13])
[' <|system|>\nYou are an AI assistant.</s> <|user|>\nhi<|assistant|>\ni']
input_ids tensor([29871, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 385,
319, 29902, 20255, 29889, 2, 29871, 529, 29989, 1792, 29989,
29958, 13, 2918, 29966, 29989, 465, 22137, 29989, 29958, 13,
29875])
Notice that ‘’ has token id 2. Up until the last printout the 2 is followed by 529, the beginning of the sequence ‘<|user|>\n’. Then all of a sudden at the last step, a 29871 is inserted between the 2 and the 529. The code that is running here is eponec.Parser.generate with an extra printout of input_ids. What I’m trying to do here is make an eponec.Parser that does chat correctly according to the chat template of the model. The extra space is causing problems for the parser, which relies on offset.
I have a feeling I’m not doing this the right way. My question is: What is happening and how to stop it from happening?