Tokenizer Error on batched=True When Using Different Cloud Service

hark99 · December 18, 2023, 12:16am

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)

TypeError: Provided function which is applied to all elements of table returns a dict of types [<class ‘list’>,
<class ‘list’>, <class ‘list’>, <class ‘list’>, <class ‘torch.Tensor’>, <class ‘torch.Tensor’>]. When using
batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>).

ynwang · May 12, 2024, 5:03am

Use this instead:

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids.numpy()
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids.numpy()
    
    return example

Topic		Replies	Views
Lab2: Error message in function Generative AI with Large Language Models week-2	2	215	May 2, 2024
[SOLVED] Potential issue with tokenize_function in week2 lab Generative AI with Large Language Models week-2	1	132	May 26, 2024
Getting error in Summarize Dialogue without Prompt Engineering Generative AI with Large Language Models week-1	1	421	August 2, 2023
Week 2 Lab: Train Error - Solved Generative AI with Large Language Models week-2	1	432	July 11, 2023
Build_dataset.tokenize logic could be simper Generative AI with Large Language Models feedback , week-3	1	171	May 2, 2024

Tokenizer Error on batched=True When Using Different Cloud Service

Related topics