I am trying to build a model using hugging face where if I give image to the model it would identify key identity fields in general like name, date of birth, country, issue date, expiry date irrespective of document type. It should support any document like passport, driving license and after identifying fields then I can apply OCR to extract data.
Earlier I tried building models separately for all documents and trained on yolov7 where I created bounding boxes around these fields and gave them classes name and mapped in my project but my project complexity is increasing since I have separate model for every document.
Am i following right approach? what better thing could be done?
Hi @priya07,
Your approach of creating separate models for each document type using YOLOv7 is effective for specialized tasks but this will increase your project complexity as you scale to handle more document types. Here is a short streamlined approach to simplify your project:
- Unified Model: Train a single object detection model to identify key identity fields (e.g., name, DOB, country, etc.) across all document types.
- Document-Type Agnostic Approach: Use transformer-based models like LayoutLMv3 or Donut for generalized document understanding, reducing the need for separate models.
- Post-Processing with OCR: Detect fields with your model and apply OCR (e.g., Tesseract, PaddleOCR) to extract text.
- End-to-End Pipeline: Normalize images, detect fields, extract text with OCR, and refine results with rules/models.
If you have any doubts, discuss your project in detail with our expert AI tech team, they will assist you with your project.
Hello, Yes I tried a unified model as well. But the accuracy decreases. If I label 10 fields for each identity field out of it only 5-6 fields are extracted and if some new document comes up which is not part of training set then it will give misplaced fields and extract wrong information. Like it will detect place of birth as name…
It would be better to connect with tech team to identify what exactly can be the solution for this issue then. Let’s either schedule a virtual meet with them at the convenient time or you can share your details via link shared.
Try DONUT approach. (clovaai/donut on github)
But i think phi3-vision should do as well. You will probably require some image slicing and smart prompting.