ValueError: Couldn't instantiate the backend tokenizer from one of:

Cheong_Wai_Ling · November 30, 2023, 9:45am

Hi, I need help. When I execute this code snippet: from utils import get_sentence_window_query_engine

sentence_window_engine = get_sentence_window_query_engine(sentence_index)

Got error: ---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

in <cell line: 3>() 1 from utils import get_sentence_window_query_engine 2 ----> 3 sentence_window_engine = get_sentence_window_query_engine(sentence_index)

7 frames

/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py in init(self, *args, **kwargs) 118 fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) 119 else: → 120 raise ValueError( 121 “Couldn’t instantiate the backend tokenizer from one of: \n” 122 “(1) a tokenizers library serialization file, \n”

ValueError: Couldn’t instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

I even install sentencepiece, but still got the above error. I am using Google Colab. Thanks

Andrew_Wilkie · December 6, 2023, 1:30am

I’d try upgrading your python interpreter to version 3.11 and see if that helps.

Cheong_Wai_Ling · December 6, 2023, 8:05am

Hi Andrew,
Thank you for your response. Yes, it’s help. But now I got another error in L3 and L4, and both are the same error. Not sure whether you can help. Here is the code: tru_recorder_3 = get_prebuilt_trulens_recorder(
sentence_window_engine_3,
app_id=‘sentence window engine 3’
) and the error: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 tru_recorder_3 = get_prebuilt_trulens_recorder(
2 sentence_window_engine_3,
3 app_id=‘sentence window engine 3’
4 )

/content/utils.py in get_prebuilt_trulens_recorder(query_engine, app_id)
64
65 def get_prebuilt_trulens_recorder(query_engine, app_id):
—> 66 tru_recorder = TruLlama(
67 query_engine,
68 app_id=app_id,

/usr/local/lib/python3.10/dist-packages/trulens_eval/tru_llama.py in init(self, app, **kwargs)
248 super().init(**kwargs)
249
→ 250 self.post_init()
251
252 @classmethod

/usr/local/lib/python3.10/dist-packages/trulens_eval/app.py in post_init(self)
484
485 for f in self.feedbacks:
→ 486 self.db.insert_feedback_definition(f)
487
488 else:

/usr/local/lib/python3.10/dist-packages/trulens_eval/database/utils.py in wrapper(*args, **kwargs)
59 def wrapper(*args, **kwargs):
60 callback(*args, **kwargs)
—> 61 return func(*args, **kwargs)
62
63 return wrapper

/usr/local/lib/python3.10/dist-packages/trulens_eval/database/sqlalchemy_db.py in insert_feedback_definition(self, feedback_definition)
193 .filter_by(feedback_definition_id=feedback_definition.feedback_definition_id)
194 .first():
→ 195 _fb_def.app_json = feedback_definition.json()
196 else:
197 _fb_def = orm.FeedbackDefinition.parse(

/usr/local/lib/python3.10/dist-packages/pydantic/main.cpython-310-x86_64-linux-gnu.so in pydantic.main.BaseModel.json()

/usr/lib/python3.10/json/init.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
236 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
237 separators=separators, default=default, sort_keys=sort_keys,
→ 238 **kw).encode(obj)
239
240

/usr/lib/python3.10/json/encoder.py in encode(self, o)
197 # exceptions aren’t as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ‘’.join() would do.
→ 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)

/usr/lib/python3.10/json/encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
→ 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/usr/local/lib/python3.10/dist-packages/pydantic/json.cpython-310-x86_64-linux-gnu.so in pydantic.json.pydantic_encoder()

TypeError: Object of type ‘OpenAI’ is not JSON serializable. If execute the first code and that is no issue: tru_recorder_1 = get_prebuilt_trulens_recorder(
sentence_window_engine_1,
app_id=‘sentence window engine 1’
) But if continue to execute the second code snippet, then the error will occur, for both L3 and L4 notebook.

Andrew_Wilkie · December 6, 2023, 10:01pm

Yes I see the same Warning. Have logged an Issue with TruLens - TypeError: Object of type 'OpenAI' is not JSON serializable · Issue #646 · truera/trulens · GitHub

Topic		Replies	Views
Sentence_window_engine = get_sentence_window_query_engine(sentence_index) --ValueError: Couldn't instantiate the backend tokenizer Building and Evaluating Advanced RAG Applications	1	377	December 6, 2023
Tokenizer Error on batched=True When Using Different Cloud Service Generative AI with Large Language Models week-module-2	1	490	May 12, 2024
L1_NLP_tasks_with_a_simple_interface error AI Discussions	3	54	August 25, 2023
C4_W1_E7 Value Error in forward pass? NLP with Attention Models week-module-1	4	414	August 7, 2023
Advanced retrieval for AI with Chroma Advanced Retrieval for AI with Chroma week-module-1	0	254	February 9, 2024

ValueError: Couldn't instantiate the backend tokenizer from one of:

7 frames

Related topics