Does this use CLIP or something to get embeddings for each image and normal text embeddings for the text fields, and then feed the top N results to a VLM (LLM) to select the best answer(s)?
What's the advantage of this over using llamaindex?
Although even asking that question I will be honest, the last thing I used llamaindex for, it seemed mostly everything had to be shoehorned in as using that library was a foregone conclusion, even though ChromaDB was doing just about all the work in the end because the built in test vector store that llamaindex has strangely bad performance with any scale.
I do like how simple the llamaindex DocumentStore or whatever is where you can just point it at a directory. But it seems when using a specific vectordb you often can't do that.
I guess the other thing people do is put everything in postgres. Do people use pgvector to store image embeddings?