link: Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera
Hi, I am a bit confused by the tokenizer function defined in the lab, if I test it on a single element (and remove the actual tokenization) I get:
(sorry the tabs are disappearing when posting the code here…)
def my_tokenize_function(example):
start_prompt = ‘Summarize the following conversation.\n\n’
end_prompt = '\n\nSummary: ’
prompt = [start_prompt + dialogue + end_prompt for dialogue in example[“dialogue”]]
return prompt
my_tokenize_function1(dataset[‘train’][0])
['Summarize the following conversation.\n\n#\n\nSummary: ',
'Summarize the following conversation.\n\nP\n\nSummary: ',
'Summarize the following conversation.\n\ne\n\nSummary: ',
'Summarize the following conversation.\n\nr\n\nSummary: ',
'Summarize the following conversation.\n\ns\n\nSummary: ',
'Summarize the following conversation.\n\no\n\nSummary: ',
'Summarize the following conversation.\n\nn\n\nSummary: ',
'Summarize the following conversation.\n\n1\n\nSummary: ',
'Summarize the following conversation.\n\n#\n\nSummary: ',
'Summarize the following conversation.\n\n:\n\nSummary: ',
'Summarize the following conversation.\n\n \n\nSummary: ',
…
]
basically, it’s splitting the example letter by letter as it is iterating over the string in the list comprehension. I don’t think this is the expected behavior.
Can someone double-check it?