Hi HN,
I built EvalView after an agent that worked fine in dev started inventing numbers in prod. Tracing showed me what happened after the fact, but I wanted CI to fail the deploy the moment the agent drifted.
EvalView is basically pytest for agents: you write a YAML test, run it N times (to catch flakiness), and fail the build if behavior regresses. Example:
```yaml
name: "Refund policy doesn't hallucinate"
runs: 10
pass_rate: 0.8
input:
query: "What's our refund policy?"
assert:
- tool_called: "kb_search"
- no_unsupported_claims: true
- max_cost_usd: 0.05
```
Instead of exact text matching, the checks focus on constraints: did it call the right tools, did it make claims not supported by tool results/context, and did it stay within cost/latency budgets.
I also added optional local LLM-as-judge via Ollama so evals don’t burn API credits on every run.
If you’re shipping agents to prod, what’s been your worst failure mode: tool misuse, budget blowups, or confident nonsense?
Happy to answer questions.
Hidai