Too many words in Vocabulary for Tokenizer: TF Course 3 W1 Assignment

Dear Instructors, I was able to get read in the data sets, get the right number of sentences and use remove Stopwords, to get the expected outputs. However, the word_index Vocabulary is Vocabulary contains 43686 words, instead of the expected 29714 words. It looks like the Tokenizer filter is not working as expected, as there are many items with $, Pound and hyphens in them, when I read in parse_data_from_file(“./data/bbc-text.csv”)

What is the best way to resolve this error? Excerpt from word_index dictionary below:

‘saturday:’: 43597, ‘heaving’: 43598, ‘(£230bn)’: 43599, ‘$2.58’: 43600, ‘(£1.38’: 43601, ‘$419.3bn.’: 43602, ‘beale’: 43603, ‘$5bn.’: 43604, ‘$4.5’: 43605, ‘20-year’: 43606, ‘stockpile.’: 43607, ‘$2.18’: 43608, ‘$2.57’: 43609, ‘pro-growth’: 43610, ‘leno’: 43611, ‘blige’: 43612, ‘uma’: 43613, ‘thurman’: 43614, ‘superstars’: 43615, ‘co-present’: 43616, ‘letter:’: 43617, ‘commercial-free’: 43618, ‘(£80m).’: 43619, ‘ditched’: 43620, ‘kingsholm.’: 43621, ‘semi-professional’: 43622, ‘rfc.’: 43623, ‘francais’: 43624, ‘beijingers’: 43625, ‘fume’: 43626, ‘choking’: 43627, ‘jams’: 43628, ‘reorganising’: 43629, ‘roads.’: 43630, ‘clogged’: 43631, ‘circling’: 43632, ‘yan’: 43633, ‘bays’: 43634, ‘250%’: 43635, ‘rage.’: 43636, ‘jacking’: 43637, ‘hourly’: 43638, ‘($48;’: 43639, ‘£26).’: 43640, ‘motorcades’: 43641, ‘outriders’: 43642, ‘unclogging’: 43643, ‘impassable’: 43644, ‘expecting.’: 43645, ‘0.9%.’: 43646, ‘grocery’: 43647, ‘1.1%.’: 43648, ‘parul’: 43649, ‘landlords’: 43650, ‘£143’: 43651, ‘seeker.’: 43652, ‘tent’: 43653, ‘flu.’: 43654, ‘balloch’: 43655, ‘lomond’: 43656, ‘kick-start’: 43657, ‘hampden’: 43658, ‘bellahouston’: 43659, ‘9-10’: 43660, ‘ticketweb’: 43661, ‘refund.’: 43662, ‘snowball’: 43663, ‘supporters)’: 43664, ‘bickering.’: 43665, ‘die.)’: 43666, ‘rationally’: 43667, ‘post-neo-classical’: 43668, ‘wonks’: 43669, ‘courses…’: 43670, ‘prisoners.’: 43671, ‘heinous’: 43672, ‘cells.’: 43673, ‘convict’: 43674, ‘mouths’: 43675, ‘induce’: 43676, ‘confess.’: 43677, ‘283’: 43678, ‘selfishly.’: 43679, ‘ensues’: 43680, ‘perverse’: 43681, ‘exhorting’: 43682, ‘solomon’: 43683, ‘participants.’: 43684, ‘allocating’: 43685, ‘heerenveen.’: 43686} Vocabulary contains 43686 words token included in vocabulary

Expected Output:

After invoking def fit_tokenizer(sentences):, max(word_index.values()) should be 29714. If not, please revisit your code.

One thing to look for is that symbols like $ should not be present in the word index. This is because, the Tokenizer removes special characters before processing texts. See filters parameter for Tokenizer

Thanks Balaji, I am using a Python 3 Kernel, and it looks like my call to Tokenizer is not filtering as expected: Here is my call details- is there something I can do to enable the filtering, as that is contributing to adding many other expressions, will message you with more details. - Priya

Please click my name and message your notebook as an attachment.
Don’t forget to remove code from your post.

Implementation of remove_stopwords is incorrect. You should return a string not a list of strings.

Please don’t change cells where you’re not asked to code. For instance, you’ve changed this content:

In starter code:

# Test your function

# With original dataset
sentences, labels = parse_data_from_file("./data/bbc-text.csv")

print("ORIGINAL DATASET:\n")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

# With a miniature version of the dataset that contains only first 5 rows
mini_sentences, mini_labels = parse_data_from_file("./data/bbc-text-minimal.csv")

print("MINIATURE DATASET:\n")
print(f"There are {len(mini_sentences)} sentences in the miniature dataset.\n")
print(f"First sentence has {len(mini_sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(mini_labels)} labels in the miniature dataset.\n")
print(f"The first 5 labels are {mini_labels[:5]}")

Your code:

# Test your function

# With original dataset
sentences, labels = parse_data_from_file("./data/bbc-text.csv")

print("ORIGINAL DATASET:\n")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0])} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

# With a miniature version of the dataset that contains only first 5 rows
mini_sentences, mini_labels = parse_data_from_file("./data/bbc-text-minimal.csv")

print("MINIATURE DATASET:\n")
print(f"There are {len(mini_sentences)} sentences in the miniature dataset.\n")
print(f"First sentence has {len(mini_sentences[0])} words (after removing stopwords).\n")
print(f"There are {len(mini_labels)} labels in the miniature dataset.\n")
print(f"The first 5 labels are {mini_labels[:5]}")

Hi Balaji,

Thanks for pinpointing the issue- I will keep in mind to modify ,code in the suggested blocks, as I can understand how it makes both my finishing the homework and your trouble shooting harder to not do so :slight_smile: Thanks so much!

     Best Regards, Priya