I am curious about the conversion of texts to numerical vectors using tf-idf. My first doubt is how to keep the dimensionality of the vectors the same, given that there are texts longer than others.
So, how the tf-idf tables/vectors are built? does the approach consider unique words in the whole training set corpus? or for each document, the unique words of that document are used to build the vectors (this may create vectors of different lengths).
If the approach, instead, uses all unique words in the corpus of the training data to build the word tables where tf-idf values are calculated for each document, will this be problematic later given that the vectors dimensions would be huge? I mean, 1000 documents with diverse lengths in the training set corpus may have thousands of unique words, and this may leave to the curse of dimensionality later on.