HN Reader

185

I'm biased in that I work on an open source project in this space, but I would strongly recommend starting with a free/open source platform for debugging/tracing, annotating, and building custom evals.

This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.

I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik

1 day agoby calebkaiser

Some great info, but I have to disagree with this:

> Q: How much time should I spend on model selection?

> Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it.

1 day agoby afro88

> Q: What makes a good custom interface for reviewing LLM outputs? Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain ...

Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.

1 day agoby ReDeiPirati

Hamel has really great practical eval advice and I always share his advice and posts to any new teams developing AI features/agents/assistants that I'm working with, both internally and with new startups in the AI applications space.

What I'd love to see one day is a way to capture this advice in a "Hamel in a box" eval copilot, or the agent that helps eval and improve other ai agents :). An eval expert who can ask the questions he's asking, look at data flowing through your system, make suggestions about how to improve your eval process, and automatically guide non experts into following good practices for their eval loop.

1 day agoby dbish

I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?

1 day agoby davedx

Fantastic FAQ, thank you Hamel for writing it up. We had an open space on AI Evals at Pycon this year, and had lots of discussion around similar questions. I only wrote down the questions, however:

# Evaluation Metrics & Methodology

* What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?

* Do you use step-by-step evaluations or evaluate full responses?

* How do you evaluate VLM (vision-language model) summarization? Do you sample outputs or extract named entities?

* How do you approach offline (ground truth) vs. online evaluation?

* How do you handle uncertainty or "don’t know" cases? (Temperature settings?)

* How do you evaluate multi-turn conversations?

* A/B comparisons and discrete labels (e.g., good/bad) are easier to interpret.

* It’s important to counteract bias toward your own favorite eval questions—ensure a diverse dataset.

## Prompting & Models

* Do you modify prompts based on the specific app being evaluated?

* Where do you store prompts—text files, Prompty, database, or in code?

* Do you have domain experts edit or review prompts?

* How do you choose which model to use?

## Evaluation Infrastructure

* How do you choose an evaluation framework?

* What platforms do you use to gather domain expert feedback or labels?

* Do domain experts label outputs or also help with prompt design?

## User Feedback & Observability

* Do you collect thumbs up / thumbs down feedback?

* How does observability help identify failure modes?

* Do models tend to favor their own outputs? (There's research on this.)

I personally work on adding evaluation to our most popular Azure RAG samples, and put a Textual CLI interface in this repo that I've found helpful for reviewing the eval results: https://github.com/Azure-Samples/ai-rag-chat-evaluator

1 day agoby pamelafox

> About AI Evals

Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is.

I can probably guess from context but I'd love to have some validation.

1 day agoby andybak

Evals are critical, and I love the practicality of this guide!

One problem not covered here is: knowing which data to review.

If your AI system produces say 95% accurate responses, your Evals team will spend too much time reviewing production logs to discover different AI failure modes.

To enable your Evals team to only spend time reviewing the high-signal responses that are likely incorrect, I built a tool that automatically surfaces the least trustworthy LLM responses:

https://help.cleanlab.ai/tlm/

Hope you find it useful, I made sure it works out-of-the-box with zero-configuration required!

1 day agoby _jonas

Great

1 day agoby Iwan-Zotow

People should be demanding consistency and traceability from the model vendors checked by some tool perhaps like this. This may tell you when the vendor changed something but there is otherwise no recourse?

1 day agoby th0ma5

This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow. And we don’t even know if it does fit the author’s experience.

I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.

Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.

1 day agoby satisfice