Fantastic FAQ, thank you Hamel for writing it up. We had an open space on AI Evals at Pycon this year, and had lots of discussion around similar questions. I only wrote down the questions, however:
# Evaluation Metrics & Methodology
* What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?
* Do you use step-by-step evaluations or evaluate full responses?
* How do you evaluate VLM (vision-language model) summarization? Do you sample outputs or extract named entities?
* How do you approach offline (ground truth) vs. online evaluation?
* How do you handle uncertainty or "don’t know" cases? (Temperature settings?)
* How do you evaluate multi-turn conversations?
* A/B comparisons and discrete labels (e.g., good/bad) are easier to interpret.
* It’s important to counteract bias toward your own favorite eval questions—ensure a diverse dataset.
## Prompting & Models
* Do you modify prompts based on the specific app being evaluated?
* Where do you store prompts—text files, Prompty, database, or in code?
* Do you have domain experts edit or review prompts?
* How do you choose which model to use?
## Evaluation Infrastructure
* How do you choose an evaluation framework?
* What platforms do you use to gather domain expert feedback or labels?
* Do domain experts label outputs or also help with prompt design?
## User Feedback & Observability
* Do you collect thumbs up / thumbs down feedback?
* How does observability help identify failure modes?
* Do models tend to favor their own outputs? (There's research on this.)
I personally work on adding evaluation to our most popular Azure RAG samples, and put a Textual CLI interface in this repo that I've found helpful for reviewing the eval results:
https://github.com/Azure-Samples/ai-rag-chat-evaluator