PDF classification

Hello everyone, I built a model to classify PDFs. The problem is that my training dataset features are almost 20,000, but when I apply the same preprocessing steps, I get only 3,000 for my test cases, which makes sense. My question is, how can I fix that so the test cases have the same shape? The training data is an n*m matrix. I am thinking about reshaping the size of the test case matrix to 1*m and then filling the missing values with 0, but that doesn’t seem logical.

Hi there,
If preprocessing is done on the both training and testing set, the features of the test and training sets should not differ.
Kind Regards
Ali Namin

1 Like

No, that doesn’t make sense. Preprocessing should handle the number of features identically for both the training and test sets.

hello @BALK , it bother you to share how which method you used to classify them ?

Hello, I think there is a misunderstanding. The training and testing you are talking about are split at the same time, so they have the same number of features. When I said “test,” I meant that in production, we will only try to classify one pdf for example, and the maximum number of words will be around 3000 words. While applying the same preprocessing, I am getting only 3000 features, but my trained model requires 20000.

Hello there, I used Naive Bayes algorithm.

If you’re pre-processing the training set before you use it to create a model, you will need to apply the same pre-processing to the examples you want to use in production.

@BALK at least IMHO, one sort of obvious question that is not specified here-- What criteria (or in what way) are you trying to classify the PDF documents on ? (i.e. Theme ? Language ? Topic ? Sentiment ? Etc)