[SOLVED] Potential issue with tokenize_function in week2 lab

Loulblemo · May 26, 2024, 9:10pm

link: Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

Hi, I am a bit confused by the tokenizer function defined in the lab, if I test it on a single element (and remove the actual tokenization) I get:

(sorry the tabs are disappearing when posting the code here…)
def my_tokenize_function(example):
start_prompt = ‘Summarize the following conversation.\n\n’
end_prompt = '\n\nSummary: ’
prompt = [start_prompt + dialogue + end_prompt for dialogue in example[“dialogue”]]
return prompt

my_tokenize_function1(dataset[‘train’][0])

['Summarize the following conversation.\n\n#\n\nSummary: ',
'Summarize the following conversation.\n\nP\n\nSummary: ',
'Summarize the following conversation.\n\ne\n\nSummary: ',
'Summarize the following conversation.\n\nr\n\nSummary: ',
'Summarize the following conversation.\n\ns\n\nSummary: ',
'Summarize the following conversation.\n\no\n\nSummary: ',
'Summarize the following conversation.\n\nn\n\nSummary: ',
'Summarize the following conversation.\n\n1\n\nSummary: ',
'Summarize the following conversation.\n\n#\n\nSummary: ',
'Summarize the following conversation.\n\n:\n\nSummary: ',
'Summarize the following conversation.\n\n \n\nSummary: ',
…
]

basically, it’s splitting the example letter by letter as it is iterating over the string in the list comprehension. I don’t think this is the expected behavior.

Can someone double-check it?

Loulblemo · May 26, 2024, 9:19pm

Ok after fiddling a bit with it I realised that when the tokenize_funcion is called on the entire dataset the input “example” is actually a dictionary of lists.

I still think the function is not quite clear as the input variable should at least be called “examples”

Topic		Replies	Views
Lab2: Error message in function Generative AI with Large Language Models week-module-2	2	222	May 2, 2024
Tokenizer Error on batched=True When Using Different Cloud Service Generative AI with Large Language Models week-module-2	1	502	May 12, 2024
Logical Error in fuction "tokenize_function" Finetuning Large Language Models	2	106	October 21, 2023
C5 W4 Lab 2 and 3, tokenizer Sequence Models coursera-platform	1	545	August 9, 2021
Lesson 5 dataset preparation - first example adds prompt text but the final function does not Finetuning Large Language Models	0	12	January 2, 2025

[SOLVED] Potential issue with tokenize_function in week2 lab

Related topics