PDF classification

BALK · June 15, 2024, 9:29am

Hello everyone, I built a model to classify PDFs. The problem is that my training dataset features are almost 20,000, but when I apply the same preprocessing steps, I get only 3,000 for my test cases, which makes sense. My question is, how can I fix that so the test cases have the same shape? The training data is an n*m matrix. I am thinking about reshaping the size of the test case matrix to 1*m and then filling the missing values with 0, but that doesn’t seem logical.

alinamin · June 15, 2024, 2:51pm

Hi there,
If preprocessing is done on the both training and testing set, the features of the test and training sets should not differ.
Kind Regards
Ali Namin

TMosh · June 15, 2024, 2:56pm

No, that doesn’t make sense. Preprocessing should handle the number of features identically for both the training and test sets.

Mourad11 · June 16, 2024, 3:10pm

hello @BALK , it bother you to share how which method you used to classify them ?

BALK · June 16, 2024, 8:29pm

Hello, I think there is a misunderstanding. The training and testing you are talking about are split at the same time, so they have the same number of features. When I said “test,” I meant that in production, we will only try to classify one pdf for example, and the maximum number of words will be around 3000 words. While applying the same preprocessing, I am getting only 3000 features, but my trained model requires 20000.

BALK · June 16, 2024, 8:30pm

Hello there, I used Naive Bayes algorithm.

TMosh · June 16, 2024, 9:35pm

If you’re pre-processing the training set before you use it to create a model, you will need to apply the same pre-processing to the examples you want to use in production.

Nevermnd · June 17, 2024, 12:44am

@BALK at least IMHO, one sort of obvious question that is not specified here-- What criteria (or in what way) are you trying to classify the PDF documents on ? (i.e. Theme ? Language ? Topic ? Sentiment ? Etc)

Topic		Replies	Views
How to run a trained network on centered features on a new unseen sample? Neural Networks and Deep Learning coursera-platform	8	539	December 10, 2022
C3W2_Assignment Week 2: Diving deeper into the BBC News archive Natural Language Processing in TensorFlow week-module-2	1	50	April 2, 2025
Test Accuracy Higher than Train accuracy? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	42	2683	August 9, 2024
C2_W3_Transfer Learning Advanced Learning Algorithms week-module-3	6	190	April 12, 2024
Different shape for features output different from the expected output Natural Language Processing in TensorFlow week-module-4	1	220	June 22, 2023

PDF classification

Related topics