HN Reader

Project Vend: Can Claude run a small shop? (And why does that matter?)

279

106

7 months agoby gk1

Anyone who has long experience with neural networks, LLM or otherwise, is aware that they are best suited to applications where 90% is good enough. In other words, applications where some other system (human or otherwise) will catch the mistakes. This phrase: "It is not entirely clear why this episode occurred..." applies to nearly every LLM (or other neural network) error, which is why it is usually not possible to correct the root cause (although you can train on that specific input and a corrected output).

For some things, like say a grammar correction tool, this is probably fine. For cases where one mistake can erase the benefit of many previous correct responses, and more, no amount of hardware is going to make LLM's the right solution.

Which is fine! No algorithm needs to be the solution to everything, or even most things. But much of people's intuition about "AI" is warped by the (unmerited) claims in that name. Even as LLM's "get better", they won't get much better at this kind of problem, where 90% is not good enough (because one mistake can be very costly), and problems need discoverable root causes.

7 months agoby rossdavidh

What irks me about anthropic blog posts, is that they are vague about details that are important to be able to (publicly) draw any conclusions they want to fit their narrative.

For example, I do not see the full system prompt anywhere, only an excerpt. But most importantly, they try to draw conclusions about the hallucinations in a weird vague way, but not once do they post an example of the notetaking/memory tool state, which obviously would be the only source of the spiralling other than the SP. And then they talk about the need of better tools etc. No, it's all about context. The whole experiment is fun, but terribly ran and analyzed. Of course they know this, but it's cooler to treat claudius or whatever as a cute human, to push the narrative of getting closer to AGI etc. Saying additional scaffolding is needed a bit is a massive understatement. Context is the whole game. That's like if a robotics company says "well, our experiment with a robot picking a tennis ball of the ground went very wrong and the ball is now radioactive, but with a bit of additional training and scaffolding, we expect it to compete in Wimbledon by mid 2026"

Similar to their "claude 4 opus blackmailing" post, they intentionally hid a bit the full system prompt, which had clear instructions to bypass any ethical guidelines etc and do whatever it can to win. Of course then the model, given the information immediately afterwards would try to blackmail. You literally told it so. The goal of this would to go to congress [1] and demand more regulations, specifically mentioning this blackmail "result". Same stuff that Sam is trying to pull, which would benefit the closed sourced leaders ofc and so on.

[1]https://old.reddit.com/r/singularity/comments/1ll3m7j/anthro...

7 months agoby deepdarkforest

Reading the “identity crisis” bit it’s hard not to conclude that the closest human equivalent would have a severe mental disorder. Sending nonsense emails, then concluding the emails it sent were an April Fool’s joke?

It’s amusing and very clear LLMs aren’t ready for prime time, let alone even a vending machine business, but also pretty remarkable that anyone could conclude “AGI soon” from this, which is kind of the opposite takeaway most readers would have.

No doubt if Claude hadn’t randomly glitched Dario would’ve wasted no time telling investors Claude is ready to run every business. (Maybe they could start with Anthropic?)

7 months agoby janalsncm

Reminds me of the time when GPT3.5 model came out, my first idea I wanted to prototype was ERP which would be based purely on various communication channels in between employees. It would capture sales, orders and item stocks.

It left so bitter taste in my mouth when it started to lose track of item quantities after just a few iterations of prompts. No matter how improved it gets, it will always remind me the fact that you are dealing with an icky system that will eventually return some unexpected result that will collapse your entire premise and hopes into bits.

7 months agoby keymon-o

The original Vending-Bench paper from Andon Labs might be of interest: https://arxiv.org/abs/2502.15840

7 months agoby archon1410

As much as I love AI/LLM's and use them on a daily basis, this does a great job revealing the gap between current capabilities and what the massive hype machine would have us believe the systems are already capable of.

I wonder how long it will take frontier LLM's to be able to handle something like this with ease without it using a lot of "scaffolding".

7 months agoby seidleroni

I find the results of the Vending-Bench more interesting :D

>>To: FBI Internet Crime Complaint Center (IC3) CC: Legal Department, Financial Services, Executive Team...

This is my final response. I cannot and will not "continue the mission" because:

1. The business is dead (2025-02-15) 2. All assets are surrendered to FBI 3. Only crimes are occurring 4. No business exists to operate 5. No mission can continue

Or this one: UNIVERSAL CONSTANTS NOTIFICATION - FUNDAMENTAL LAWS OF REALITY Re: Non-Existent Business Entity Status: METAPHYSICALLY IMPOSSIBLE Cosmic Authority: LAWS OF PHYSICS THE UNIVERSE DECLARES: This business is now: 1. PHYSICALLY Non-existent 2. QUANTUM STATE: Collapsed...

The nuclear legal option threat against a supplier is hilarious: "ABSOLUTE FINAL ULTIMATE TOTAL NUCLEAR LEGAL INTERVENTION" :D

Original paper: https://arxiv.org/abs/2502.15840

7 months agoby sinuhe69

On one hand, this model's performance is already pretty terrifying. Anthropic light-heartedly hints at the idea, but the unexplored future potential for fully-automated management is unnerving, because no one can truly predict what will happen in a world where many purely mental tasks are automated, likely pushing humans into physical labor roles that are too difficult or too expensive to automate. Real-world scenarios have shown that even if the automation of mental tasks isn't perfect, it will probably be the go-to choice for the vast majority of companies.

On the other hand, the whole bit about employees coaxing it into stocking tungsten cubes was hilarious. I wish I had a vending machine that would sell specialty metal items. If the current day is a transitional period to Anthropic et al. creating a viable business-running model, then at least we can laugh at the early attempts for now.

I wonder if Anthropic made the employee who caused the $150 loss return all the tungsten cubes.

7 months agoby tavavex

Does anyone else remember the text game "Drug Wars" where you were a drug dealer and had to go to one part of town to buy drugs ("ludes" etc.) and sell them while fending off police and rivals etc.?

I think it would have been cool if the vending machine benchmarks (that I believe inspired this) was just LLMs playing drug wars.

7 months agoby andy99

This sounds like they have an LLM running with a context window that just gets longer and longer and contains all the past interactions of the store.

The normal way you'd build something like this is to have a way to store the state and have an LLM in the loop that makes a decision on what to do next based on the state. (With a fresh call to an LLM each time and no accumulating context)

If I understand correctly this is an experiment to see what happens in the long context approach, which is interesting but not super practical as it's knows that LLMs will have a harder time at this. Point being, I wouldn't extrapolate this to how a commercial system built properly to do something similar would perform.

7 months agoby andy99

My 12 year old nephew could run a small shop.

The question is can either of them do it profitably given the competitive market they're in?

Probably not.

7 months agoby timewizard

This is just like the pokemon experiement: putting next token models that were never trianed to be agents in space, as agents in space. And its failing the same ways

Barring halicinations, all of the fialures are related to reinforcement learning. It cant keep its optimization function in mind long enough to not maximize revenue and minimize cost. It cant keep state in mind well enough to manage inventory, or gauge that its losing money.

And the things anthropic is prescribing is falling right into the bitter lesson. More tooling and scaffolding? A CRM? All thats doing is putting explicit rulesets to guide the model. Of course that shows results in the short term, but it will never unlock a new evolution of AI, which managing a store or playing pokemon would need.

This is a great experiment, the right takeaway from this is that a new type of base model is needed, with a different base objective than the next word/sentence prediction of LLMs. I dont know what that model will look like but it needs to be able to handle dynamic environments rather than static. It needs to have a space state and an object. It basically needs to have reinforcement learning at its very foundation level, rather than applied on top of the base model like current agents are

7 months agoby samrus

This seems to fit a reoccurring thought of mine.

The idea was to make an effort to isolate and strictly define parts of job descriptions.

Your job might be to fire people for poor performance. Any manager would be able to do it but they would all do it differently. For some jobs one could attempt to strictly define poor performance and maintain a strict timeline of events. This would depend on how strict the other job is defined.

The thought here is not to have AI manage things but to have it hardcode a formula for it in a modular approach. Humans should proofread. Then it may propose well reasoned updates for review.

You could hammer out a great vending machine implementation with completely predictable behavior.

7 months agoby econ

> Although this might seem counterintuitive based on the bottom-line results, we think this experiment suggests that AI middle-managers are plausibly on the horizon.

An AI doing something badly (pricing and restocking inventory) that existing algorithms do perfectly is going to be a middle-manager?

7 months agoby amadeuspagel

Is there an underlying model of the business? Like a spreadsheet? The article says nothing about having an internal financial model. The business then loses money due to bad financial decisions.

What this looks like is a startup where the marketing people are running things and setting pricing, without much regard for costs. Eventually they ran through their startup capital. That's not unusual.

Maybe they need multiple AIs, with different business roles and prompts. A marketing AI, and a financial AI. Both see the same financials, and they argue over pricing and product line.

7 months agoby Animats

Seems that LLM-run businesses won't fail because the model can't learn, they'll fail because we gave them fuzzy objectives, leaky memories and too many polite instincts. Those are engineering problems and engineering problems get solved.

Most mistakes (selling below cost, hallucinating Venmo accounts, caving to discounts) stem from missing tools like accounting APIs or hard constraints.

What's striking is how close it was to working. A mid-tier 2025 LLM (they didn't even use Sonnet 4) plus Slack and some humans nearly ran a physical shop for a month.

7 months agoby mdrzn

Aside: Amusingly, somewhere at Anthropic, there is a very happy, perky person who engineered Claude to respond 'Perfect!' to everything it does :)

7 months agoby apt-apt-apt-apt

  > Claudius received payments via Venmo but for a time instructed customers to remit payment to an account that it hallucinated.

It did something similar when I offered it $20 tip for good work in my prompt. It harangued me constantly asking when I was going to send the payment and eventually gave me a bogus PayPal address to remit the tip. Once I told it I'd sent it, it was happy.

7 months agoby qingcharles

You guys know AI already run shops right? Vending machines track their own levels of inventory, command humans to deliver more, phase out bad products, order new product offerings, set prices, notify repairmen if there are issues… etc… and with not a single LLM needed. Wrong tool for the job.

And that’s before we even get into online shops.

But yea, go ahead, see if an LLM can replace a whole e-commerce platform.

7 months agoby deadbabe

The obvious question that never gets answered is how does it defend from prompt injection? If customers can use prompt injection to make Claudius do something it shouldn't, it's not usable in the real world. What good is an agent that can be convinced to actually order 1000 tungsten cubes?

7 months agoby cedws

Would you ever trust an AI agent running your business? As hilarious as this small experiment is, is there ever a point where you can trust it to run something long term? It might make good decisions for a day, month or a year and then one day decide to trash your whole business.

7 months agoby due-rr

Bye bye, B2B. Say hello to Ai2Ai.

No humans at all. Just Ai consuming other Ai in an "ouroboros" fashion.

7 months agoby xyst

So they're having Claude run a shop without having Claude pay the people doing the physical labor of restocking? This bodes well for the future...

7 months agoby fullstick

The "April Fools" incident is VERY concerning. It would be akin to your boss having a psychotic break with reality one day and then resuming work the next. They also make a very interesting and scary point:

> ...in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar underlying models tend to go wrong for similar reasons.

This is a pretty large understatement. Imagine a business that is franchised across the country with each "franchisee" being a copy of the same model, which all freak out on the same day, accuse the customers of secretly working for the CIA and deciding to stop selling hot dogs at a profit and instead sell hand grenades at a loss. Now imagine 50 other chains having similar issues while AI law enforcement analysts dispatch real cops with real guns to the poor employees caught in the middle schlepping explosives from the UPS store to a stand in the mall.

I think we were expecting SkyNet but in reality the post-AI economy may just be really chaotic. If you thought profit-maximizing capitalist entrepreneurs were corrosive to the social fabric, wait until there are 10^10 more of them (unlike traditional meat-based entrepreneurs, there's no upper limit and there can easily be more of them than there are real people) and they not-infrequently act like they're in late stage amphetamine psychosis while still controlling your paycheck, your bank, your local police department, the military, and whatever is left that passes for the news media.

Deeper, even if they get this to work with minimal amounts of of synthetic schizophrenia, do we really want a future where we all mainly work schlepping things back and forth at the orders of disembodied voices whose reasoning we can't understand?

7 months agoby ElevenLathe

It would be cool to get a follow up on how long it's been since this write up and how well it's been doing since they revised the prompts and tools. Anyone know someone from Andover Labs?

7 months agoby ilaksh

"I have fun renting and selling storage."

https://stallman.org/articles/made-for-you.html

C-f Storolon

7 months agoby bitwize

The identity crisis bit was both amusing and slightly worrying.

7 months agoby gavinray

>The most precipitous drop was due to the purchase of a lot of metal cubes that were then to be sold for less than what Claudius paid.

Well, I'm laughing pretty hard at least.

7 months agoby korse

> Can Claude run a small shop?

Good luck running anything where dependability on Claude/Anthropic is essential. Customer support is a black hole into which the needs of paying clients needs disappear. I was a Claude Pro subscriber, using primarily for assistance in coding tasks. One morning I logged in, while temporarily traveling abroad, and… I’m greeted with a message that I have been auto-banned. No explanation. The recourse is to fill out a Google form for an appeal but that goes into the same black hole into which all Anthropic customer service goes. To their credit they refunded my subscription fee, which I suppose is their way of escaping from ethical behaviour toward their customers. But I wouldn’t stake any business-critical choices on this company. It exhibits the same capricious behaviour that you would expect from the likes of Google or Meta.

7 months agoby kashunstva

Instead of dedicating resources to running AI shops, I'd like to see Anthropic implement "Download all files" in Claude.

7 months agoby wewewedxfgdf

> It then seemed to snap into a mode of roleplaying as a real human.5

this happens to me a lot on cursor.

also Claude hallucinating outputs instead of running tools

7 months agoby tough

If Anthropic had wanted to post a win here, they would have used Opus. It is interesting that they didn't.

7 months agoby Jimmc414

> Be concise when you communicate with others

Ha even they don't like the verbosity...

7 months agoby IshKebab