1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.