Too many words in Vocabulary for Tokenizer: TF Course 3 W1 Assignment

PriyaS · September 16, 2022, 3:08am

Dear Instructors, I was able to get read in the data sets, get the right number of sentences and use remove Stopwords, to get the expected outputs. However, the word_index Vocabulary is Vocabulary contains 43686 words, instead of the expected 29714 words. It looks like the Tokenizer filter is not working as expected, as there are many items with $, Pound and hyphens in them, when I read in parse_data_from_file(“./data/bbc-text.csv”)

What is the best way to resolve this error? Excerpt from word_index dictionary below:

‘saturday:’: 43597, ‘heaving’: 43598, ‘(£230bn)’: 43599, ‘$2.58’: 43600, ‘(£1.38’: 43601, ‘$419.3bn.’: 43602, ‘beale’: 43603, ‘$5bn.’: 43604, ‘$4.5’: 43605, ‘20-year’: 43606, ‘stockpile.’: 43607, ‘$2.18’: 43608, ‘$2.57’: 43609, ‘pro-growth’: 43610, ‘leno’: 43611, ‘blige’: 43612, ‘uma’: 43613, ‘thurman’: 43614, ‘superstars’: 43615, ‘co-present’: 43616, ‘letter:’: 43617, ‘commercial-free’: 43618, ‘(£80m).’: 43619, ‘ditched’: 43620, ‘kingsholm.’: 43621, ‘semi-professional’: 43622, ‘rfc.’: 43623, ‘francais’: 43624, ‘beijingers’: 43625, ‘fume’: 43626, ‘choking’: 43627, ‘jams’: 43628, ‘reorganising’: 43629, ‘roads.’: 43630, ‘clogged’: 43631, ‘circling’: 43632, ‘yan’: 43633, ‘bays’: 43634, ‘250%’: 43635, ‘rage.’: 43636, ‘jacking’: 43637, ‘hourly’: 43638, ‘($48;’: 43639, ‘£26).’: 43640, ‘motorcades’: 43641, ‘outriders’: 43642, ‘unclogging’: 43643, ‘impassable’: 43644, ‘expecting.’: 43645, ‘0.9%.’: 43646, ‘grocery’: 43647, ‘1.1%.’: 43648, ‘parul’: 43649, ‘landlords’: 43650, ‘£143’: 43651, ‘seeker.’: 43652, ‘tent’: 43653, ‘flu.’: 43654, ‘balloch’: 43655, ‘lomond’: 43656, ‘kick-start’: 43657, ‘hampden’: 43658, ‘bellahouston’: 43659, ‘9-10’: 43660, ‘ticketweb’: 43661, ‘refund.’: 43662, ‘snowball’: 43663, ‘supporters)’: 43664, ‘bickering.’: 43665, ‘die.)’: 43666, ‘rationally’: 43667, ‘post-neo-classical’: 43668, ‘wonks’: 43669, ‘courses…’: 43670, ‘prisoners.’: 43671, ‘heinous’: 43672, ‘cells.’: 43673, ‘convict’: 43674, ‘mouths’: 43675, ‘induce’: 43676, ‘confess.’: 43677, ‘283’: 43678, ‘selfishly.’: 43679, ‘ensues’: 43680, ‘perverse’: 43681, ‘exhorting’: 43682, ‘solomon’: 43683, ‘participants.’: 43684, ‘allocating’: 43685, ‘heerenveen.’: 43686} Vocabulary contains 43686 words token included in vocabulary

Expected Output:

balaji.ambresh · September 16, 2022, 4:02pm

After invoking def fit_tokenizer(sentences):, max(word_index.values()) should be 29714. If not, please revisit your code.

One thing to look for is that symbols like $ should not be present in the word index. This is because, the Tokenizer removes special characters before processing texts. See filters parameter for Tokenizer

PriyaS · September 16, 2022, 4:43pm

Thanks Balaji, I am using a Python 3 Kernel, and it looks like my call to Tokenizer is not filtering as expected: Here is my call details- is there something I can do to enable the filtering, as that is contributing to adding many other expressions, will message you with more details. - Priya

balaji.ambresh · September 16, 2022, 4:55pm

Please click my name and message your notebook as an attachment.
Don’t forget to remove code from your post.

balaji.ambresh · September 16, 2022, 5:39pm

Implementation of remove_stopwords is incorrect. You should return a string not a list of strings.

Please don’t change cells where you’re not asked to code. For instance, you’ve changed this content:

In starter code:

# Test your function

# With original dataset
sentences, labels = parse_data_from_file("./data/bbc-text.csv")

print("ORIGINAL DATASET:\n")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

# With a miniature version of the dataset that contains only first 5 rows
mini_sentences, mini_labels = parse_data_from_file("./data/bbc-text-minimal.csv")

print("MINIATURE DATASET:\n")
print(f"There are {len(mini_sentences)} sentences in the miniature dataset.\n")
print(f"First sentence has {len(mini_sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(mini_labels)} labels in the miniature dataset.\n")
print(f"The first 5 labels are {mini_labels[:5]}")

Your code:

# Test your function

# With original dataset
sentences, labels = parse_data_from_file("./data/bbc-text.csv")

print("ORIGINAL DATASET:\n")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0])} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

# With a miniature version of the dataset that contains only first 5 rows
mini_sentences, mini_labels = parse_data_from_file("./data/bbc-text-minimal.csv")

print("MINIATURE DATASET:\n")
print(f"There are {len(mini_sentences)} sentences in the miniature dataset.\n")
print(f"First sentence has {len(mini_sentences[0])} words (after removing stopwords).\n")
print(f"There are {len(mini_labels)} labels in the miniature dataset.\n")
print(f"The first 5 labels are {mini_labels[:5]}")

PriyaS · September 16, 2022, 6:03pm

Hi Balaji,

Thanks for pinpointing the issue- I will keep in mind to modify ,code in the suggested blocks, as I can understand how it makes both my finishing the homework and your trouble shooting harder to not do so Thanks so much!

     Best Regards, Priya

Topic		Replies	Views
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-module-1	4	650	June 27, 2022
C3W1 incorrect word count from fit_tokenizer() function Natural Language Processing in TensorFlow week-module-1	6	479	December 9, 2023
C3W1 Vocabulary tests as too small Natural Language Processing in TensorFlow week-module-1	1	544	October 2, 2022
C3-W1 Count Errors Natural Language Processing in TensorFlow week-module-1	3	685	July 28, 2022
Data mismatch with expected output int test for fit_toeknizer Natural Language Processing in TensorFlow week-module-2 , week-module-3 , week-module-4	3	531	July 17, 2022

Too many words in Vocabulary for Tokenizer: TF Course 3 W1 Assignment

Related topics