Tf-idf approach questions

I am curious about the conversion of texts to numerical vectors using tf-idf. My first doubt is how to keep the dimensionality of the vectors the same, given that there are texts longer than others.

So, how the tf-idf tables/vectors are built? does the approach consider unique words in the whole training set corpus? or for each document, the unique words of that document are used to build the vectors (this may create vectors of different lengths).

If the approach, instead, uses all unique words in the corpus of the training data to build the word tables where tf-idf values are calculated for each document, will this be problematic later given that the vectors dimensions would be huge? I mean, 1000 documents with diverse lengths in the training set corpus may have thousands of unique words, and this may leave to the curse of dimensionality later on.

Using unique words will push you into insane numbers of dimensions very quickly - and your matrices will be sparse. Both are problems for ML so people avoid that.

You can always apply some constrains:

  • use only 2n grams instead of 3n grams,
  • reject words that appear WAY too often,
  • reject words that appear so VERY little,
  • if you know that some parts of text are meaningless (for example names), you can run your documents through tagger and remove those parts.

For the details how tf-idf works: