Im currently working on a project which extracts info from a pdf and matches it with a target labels set by the user. I need to parse all sorts of pdf formats. Im having trouble gathering data for this purpose. Like if i get the pdfs how do i label and them
Any and all help is appreciated. Thanks in advance
@AbishekSathiyendran sorry, but up front I have to ask what you mean by ‘all sorts of PDF formats ?’.
Isn’t PDF, well PDF ?
Or are you speaking of the layout ?
1 Like
well for example if we take a cv there are different cv formats. So if you have to parse through all of that for applicant details. Something like that
Well, assuming you have the digital, not print copy of the document, I don’t think you need OCR.
I am going to ‘show my age’ a little, but I think Imagemagick could parse it for you, but that is in the realm of PHP. Perhaps they’ve made a Python translation(?)
1 Like
Maybe, but my main issue is after I parse it how do I get the exact structure of information I want. For example people’s work experience come in different sizes and formats if we take the same cv problem. How do I set it up to extract that experience data properly.
@AbishekSathiyendran if you decide to do DLS, and then NLP, I think this will give you some ideas how you might approach this problem.
I was kind of thinking, first, something more basic-- I mean you have to get the text out of the document, right (?) and how might you achieve that.
Yes, today there are all these ‘multimodal’ solutions, but I say go simple first and start from there.
2 Likes
Use this library for extracting data from pdfs - Welcome to PDF Parser’s documentation! — PDF Parser documentation.
Try this tutorial- How to Extract Data from PDF Files with Python.
As mentioned by @Nevermnd , pdfs are of a single format. Take the simpler solution and work with that.
2 Likes
Thank you for the recommendations, will definitely check them out.
1 Like