Build_dataset.tokenize logic could be simper

Hi, the lab for week 3 has a build_dataset function with a nested tokenize function that sets two dictionary values for sample, “input_ids” and “query”. As it’s currently written the query slot is defined like this:

        sample["query"] = tokenizer.decode(sample["input_ids"])

But this is just undoing the tokenization work done on the prompt variable from the line above. A simpler assignment could look like this:

        sample["query"] = prompt

It might make even more sense if you reverse the order of assignments to the sample slots like this:

        sample["query"] = prompt
        sample["input_ids"] = tokenizer.encode(prompt)

Chris

Thanks for your suggestion.