C4W3 SentencePiece and BPE Lab Lossless Tokenization

esav · December 21, 2023, 10:47pm

Hi,

I’m reading about lossless tokenization in the SentencePiece and BPE Lab and I’m wondering if perhaps there’s something wrong in the two code snippets provided? Between the two snippets is the text

Reversing the order of the second and third operations…

that implies that the 2nd & 3rd operations interact with each other, but in both snippets the normalize() function operates on the raw string ‘Tokenization is hard.’ instead of the variable s, so its output seems unrelated to the replace() code & thus the order of the two operations doesn’t seem to matter. I’m just a bit confused about what I’m supposed to learn from these two snippets.

To be more explicit here, this is what the 2nd code snippet looks like:

s = 'Tokenization is hard.'
sn = normalize('NFKC', 'Tokenization is hard.')
sn_ = s.replace(' ', '\u2581')

But I’d expected instead to see code where the 2nd and 3rd operations are related, i.e. this:

s = 'Tokenization is hard.'
sn = normalize('NFKC', s)
sn_ = sn.replace(' ', '\u2581')

Let me know what you think!

jyadav202 · December 22, 2023, 12:15pm

Hi, The code does look confusing and does not solve the purpose well. We are looking at this section for improvements.
Meanwhile you can read about Lossless Tokenization from the following resources: Blogpost , paper

Topic		Replies	Views
UNQ_C7 GRADED FUNCTION: viterbi_backward, 3/4 tests correct NLP with Probabilistic Models week-2	21	917	December 11, 2024
C3W3_Assignment - Issue with Triplet Loss NLP with Sequence Models week-3	15	325	October 5, 2024
Exercise 3 - get_tokenized_data C2_W3 NLP with Probabilistic Models week-3	3	72	June 25, 2024
Problem with Week 3 Exercise 10 NLP with Probabilistic Models week-3	11	657	September 29, 2022
NLP course Sequence Models Week 3 final programming assignment doubt NLP with Sequence Models week-3	2	200	May 17, 2024

C4W3 SentencePiece and BPE Lab Lossless Tokenization

Related topics