C4W3 SentencePiece and BPE Lab Lossless Tokenization


I’m reading about lossless tokenization in the SentencePiece and BPE Lab and I’m wondering if perhaps there’s something wrong in the two code snippets provided? Between the two snippets is the text

Reversing the order of the second and third operations…

that implies that the 2nd & 3rd operations interact with each other, but in both snippets the normalize() function operates on the raw string ‘Tokenization is hard.’ instead of the variable s, so its output seems unrelated to the replace() code & thus the order of the two operations doesn’t seem to matter. I’m just a bit confused about what I’m supposed to learn from these two snippets.

To be more explicit here, this is what the 2nd code snippet looks like:

s = 'Tokenization is hard.'
sn = normalize('NFKC', 'Tokenization is hard.')
sn_ = s.replace(' ', '\u2581')

But I’d expected instead to see code where the 2nd and 3rd operations are related, i.e. this:

s = 'Tokenization is hard.'
sn = normalize('NFKC', s)
sn_ = sn.replace(' ', '\u2581')

Let me know what you think!

Hi, The code does look confusing and does not solve the purpose well. We are looking at this section for improvements.
Meanwhile you can read about Lossless Tokenization from the following resources: Blogpost , paper