You need to provide more information.
Is your data organized or is just a dump of unrelated content?
- If you have a bag of files without any metadata the best option is to create something like a RAG, with a pre OCR step for image files (or even some multimodal model call).
- If the content is well organized with a logic structure an agent could extract information with a little look around.
Is static or varies day by day?
- If is static you could index all at once, if not, an agent that pick what to reindex would be a better call.
I'm not aware of a solution like this, but seems doable as an MCP server. But the cost will scale quiclky.