Why are we filtering out words that occurred once?

In the C2_W2_lecture_nb_1_strings_tags lab, there is a particular line that says

Create the vocabulary by filtering the ‘freq’ dictionary

vocab = [k for k, v in freq.items() if (v > 1 and k != ‘\n’)]

Why use v > 1?

Hi @Sharthak_Ghosh.
Okay in this example we use v > 1 to remove words that occurred once as you mentioned.
There are reasons to do that for example:

  • Noise Reduction: Sometimes these words that occur only once in a dataset might be typographical errors, misspellings, or proper nouns that don’t contribute much to the overall understanding of the text. Removing them helps reduce noise in the vocabulary.

  • Reducing Overfitting: rare words that occur only once might lead to overfitting, especially when dealing with limited training data, so by removing them you are reducing the risk of overfitting

  • Focus on Common Vocabulary: By logic meaningful and informative words are those that occur most frequently. By focusing on these words you are about the capture the core vocabulary of the text

I hope it helped you a bit

Thanks a lot, makes sense now, but could you expand on the second point that is overfitting? Could you please explain the impact on the math that would cause it to overfit? Wouldn’t a large number of single words cause this problem?

Hey @Sharthak_Ghosh

As i’m not the Mentor for this specialization i will do my best to answer you and other NLP Mentors can add notes if i have forgot anything

How the model might overfit from rare words here’s how:

  1. Low-Frequency Words: As i mentioned words that appear only once or a few times in the entire dataset might not carry much meaningful information. Including such words in the vocabulary can lead to the model learning specific instances and noise present in the training data.

  2. Memorization: If we have many rare words maybe the model start to associate particular rare words with specific training examples, it might start memorizing these examples. This memorization is not generalizable to new data. So as you know we need our model to generalize the learning and not memorize it.

  3. Reduced Generalization: As i mentioned about generalization reason. The model’s ability to generalize is compromised when it’s too focused on rare words. It can fail to recognize higher-level patterns and relationships that are important for accurate predictions on unseen data.

And yes a large number of single words can cause overfitting as well.
Now maybe you are wondering if you already know for example that you have small number of single words can you leave them?

The answer: The decision of whether to leave single words in your dataset's vocabulary or to filter them out depends on your specific goals, the size and nature of your dataset, and the characteristics of the problem you're trying to solve

It’s just considered as a best practice to filter out single words.
I hope you got it now

1 Like

Thank you, will try to read up more about this.

1 Like