HN Reader

Show HN: Agentic Evaluators for Agentic Workflows (Starting with RAG)

Hey all! Thought this group might find this interesting - new approach to evaluating RAG pipelines using 'agents as a judge'. We got excited by the findings in this paper (https://arxiv.org/abs/2410.10934), about agents producing evaluations closer to human-evaluators, especially for multi-step workflows.

Our first use case was RAG pipelines, specifically evaluating if your agent MISSED pulling any important chunks from the source document. While many RAG evaluators determine if your model USED its chunk in the output, there's no visibility on if your model grabbed all the right chunks in the first place. We thought we'd test the 'agent as judge', with a new metric called 'potential sources missed', to help evaluate if your agents are missing any important chunks from the source of truth.

Curious what you all think!

RAG is amazing. For all the RAG techniques that are being used by enterprises today, here is an amazing repo that provides Google Collab Links for each technique implementation along with an evaluation framework fully implemented. Check out: https://github.com/athina-ai/rag-cookbooks

1 year agoby paras_athina

One of the founders at Lytix here

It was pretty interesting, we started with LLM-as-a-judge, but noticed a big jump in human aligned accuracy when switching to a agentic evaluation approach. Was a lot of fun to work on!

1 year agoby sidpremkumar1