Preprocessing | More Information

Hi,

Below are the preprocessing, which we have to perform:

  • Eliminate handles and URLs
  • Tokenize the string into words.
  • Remove stop words like “and, is, a, on, etc.”
  • Stemming- or convert every word to its stem. Like dancer, dancing, danced, becomes ‘danc’. You can use porter stemmer to take care of this.
  • Convert all your words to lower case.

My question is, As a best practice is there any order of above steps, which we can follow?

Thanks

I recommend lowercasing be done before stopword removal etc. since case matters.
See this link for a good example on text preprocessing.

I recommend lowercasing be done before stopword removal etc. since case matters.
See this link for a good example on text preprocessing.