[SOLVED] Potential issue with tokenize_function in week2 lab

link: Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

Hi, I am a bit confused by the tokenizer function defined in the lab, if I test it on a single element (and remove the actual tokenization) I get:

(sorry the tabs are disappearing when posting the code here…)
def my_tokenize_function(example):
start_prompt = ‘Summarize the following conversation.\n\n’
end_prompt = '\n\nSummary: ’
prompt = [start_prompt + dialogue + end_prompt for dialogue in example[“dialogue”]]
return prompt

my_tokenize_function1(dataset[‘train’][0])

['Summarize the following conversation.\n\n#\n\nSummary: ',
'Summarize the following conversation.\n\nP\n\nSummary: ',
'Summarize the following conversation.\n\ne\n\nSummary: ',
'Summarize the following conversation.\n\nr\n\nSummary: ',
'Summarize the following conversation.\n\ns\n\nSummary: ',
'Summarize the following conversation.\n\no\n\nSummary: ',
'Summarize the following conversation.\n\nn\n\nSummary: ',
'Summarize the following conversation.\n\n1\n\nSummary: ',
'Summarize the following conversation.\n\n#\n\nSummary: ',
'Summarize the following conversation.\n\n:\n\nSummary: ',
'Summarize the following conversation.\n\n \n\nSummary: ',

]

basically, it’s splitting the example letter by letter as it is iterating over the string in the list comprehension. I don’t think this is the expected behavior.

Can someone double-check it?

Ok after fiddling a bit with it I realised that when the tokenize_funcion is called on the entire dataset the input “example” is actually a dictionary of lists.

I still think the function is not quite clear as the input variable should at least be called “examples”