From the source, Documind appears to:
1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.
Some issues with this approach:
* OpenAI may retain and use your data for training, raising privacy concerns [1].
* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.
* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.
* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!
While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.
---
1: https://platform.openai.com/docs/models#how-we-use-your-data
2: https://github.com/axa-group/Parsr