Help with pdf data

AbishekSathiyendran · September 9, 2024, 4:58pm

Im currently working on a project which extracts info from a pdf and matches it with a target labels set by the user. I need to parse all sorts of pdf formats. Im having trouble gathering data for this purpose. Like if i get the pdfs how do i label and them

Any and all help is appreciated. Thanks in advance

Nevermnd · September 9, 2024, 5:00pm

@AbishekSathiyendran sorry, but up front I have to ask what you mean by ‘all sorts of PDF formats ?’.

Isn’t PDF, well PDF ?

Or are you speaking of the layout ?

AbishekSathiyendran · September 9, 2024, 5:16pm

well for example if we take a cv there are different cv formats. So if you have to parse through all of that for applicant details. Something like that

Nevermnd · September 9, 2024, 5:30pm

Well, assuming you have the digital, not print copy of the document, I don’t think you need OCR.

I am going to ‘show my age’ a little, but I think Imagemagick could parse it for you, but that is in the realm of PHP. Perhaps they’ve made a Python translation(?)

AbishekSathiyendran · September 10, 2024, 3:21am

Maybe, but my main issue is after I parse it how do I get the exact structure of information I want. For example people’s work experience come in different sizes and formats if we take the same cv problem. How do I set it up to extract that experience data properly.

Nevermnd · September 10, 2024, 3:27am

@AbishekSathiyendran if you decide to do DLS, and then NLP, I think this will give you some ideas how you might approach this problem.

I was kind of thinking, first, something more basic-- I mean you have to get the text out of the document, right (?) and how might you achieve that.

Yes, today there are all these ‘multimodal’ solutions, but I say go simple first and start from there.

Rorisang · September 12, 2024, 6:11am

Use this library for extracting data from pdfs - Welcome to PDF Parser’s documentation! — PDF Parser documentation.

Try this tutorial- How to Extract Data from PDF Files with Python.

As mentioned by @Nevermnd , pdfs are of a single format. Take the simpler solution and work with that.

AbishekSathiyendran · September 12, 2024, 10:44am

Thank you for the recommendations, will definitely check them out.

Topic		Replies	Views
[How to] Generate dataset from pdf/documents? Finetuning Large Language Models	3	357	September 4, 2023
PDF with tabular data AI Discussions ai-discussions , project	9	2188	March 22, 2024
Programming Assignment: Question Answering NLP with Attention Models week-module-3	3	203	April 1, 2024
Models to use for text extraction? AI Discussions	0	52	December 7, 2023
Enhancing Document Layout Analysis by Adding Positional and Character Information to CNN Inputs AI Discussions ai-discussions , introductions , project	0	19	July 12, 2024

Help with pdf data

Related topics