HN Reader

Our LLM-controlled office robot can't pass butter

229

117

Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

The internal dialog breakdowns from Claude Sonnet 3.5 when the robot battery was dying are wild (pages 11-13): https://arxiv.org/pdf/2510.21860

3 months agoby lukeinator42

95% for humans. Who failed to get the butter?

3 months agoby koeng

I wonder whether that LLM has actually lost its mind so to speak or was just attempting to emulate humans who lose their minds?

Or to put it another way, if the writings of humans who have lost their minds (and dialogue of characters who have lost their minds) were entirely missing from the LLM’s training set, would the LLM still output text like this?

3 months agoby ummonk

Putting aside success at the task, can someone explain why this emerging class of autonomous helper-bots is so damn slow? I remember google unveiled their experiments in this recently and even the sped-up demo reels were excruciating to sit through. We generally think of computers as able to think much faster than us, even if they are making wrong decisions quickly, so what's the source of latency in these sytems?

3 months agoby ghostly_s

I guess I'm very confused as to why just throwing an LLM at a problem like this is interesting. I can see how the LLM is great at decomposing user requests into commands. I had great success with this on a personal assistant project I helped prototype. The LLM did a great job of understanding user intent and even extracting parameters regarding the requested task.

But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.

3 months agoby DubiousPusher

The most surprising thing is that 5% of humans apparently failed this task! Where are they finding these test subjects?!

3 months agoby Reason077

I have a cat that will never fail to find the butter. Will it bring you the butter? Ha ha, of course not.

3 months agoby Finnucane

will noone claim the Rick and Morty reference? I've seen that show like, once and somehow I know this?

3 months agoby zzzeek

I built a whimsical LLM-driven robot to provide running commentary for my yard: https://www.chrisfenton.com/meet-grasso-the-yard-robot/

3 months agoby fentonc

Guess it has no purpose then

3 months agoby WilsonSquared

> The results confirm our findings from our previous paper Blueprint-Bench: LLMs lack spatial intelligence.

But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.

3 months agoby amelius

Funny I was looking at the chart like "what model is Human?"

3 months agoby ge96

Using an LLM for robot actuator control seems like pounding a screw. Wrong tool for the job.

Someday, and given the billions being thrown at the problem, not too far out, someone will figure out what the right tool is.

3 months agoby Animats

The error messages were truly epic, got quite a chuckle.

But boy am I glad that this is just in the play stage.

If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.

3 months agoby sam_goody

95% pass rate for humans

waiting for the huggingface Lora

3 months agoby yieldcrv

Someone actually paid for this?

3 months agoby bhewes

when all you have is a hammer... everything looks like a nail

3 months agoby pengaru

How can I get early access to this "Human" model on the benchmarks? /s

3 months agoby hidelooktropic

It feels misguided to me.

I think the real value of llms for robotics is in human language parsing.

Turning "pass the butter" to a list of tasks the rest of the system is trained to perform, locate an object, pick up an object, locate a target area, drop off the object.

3 months agoby throwawayffffas

>Our LLM-controlled office robot can't pass butter

was the script of Last Tango in Paris part of the training data? maybe it's just scared...

3 months agoby fsckboy