Why a world-class AI agent couldn’t manage a vending machine

Just last week we published why ChatGPT won't fix your demand forecasting problems. Then Anthropic (a leading research lab that consistently produces LLMs that outperform OpenAI and Google) published a fascinating real-world experiment that confirms our findings. 

Anthropic created an LLM agent to run a vending machine in their San Francisco office. This is one of the simplest operations possible: buy products, manage inventory, set prices, talk to customers and try to turn a profit.

The results speak for themselves:

Source: Anthropic

After the month-long experiment, Anthropic concluded that their agent “made too many mistakes to run the shop successfully”. Specifically, their agent lacked basic business sense, hallucinated critical information, and failed to learn from its mistakes

This article will explore a few of the questions raised by Anthropic’s experiment:

  1. What AI failure modes should supply chain professionals watch out for?

  2. How is it possible for an AI agent to fail at such a simple job? (especially when fleets of AI cars are successfully driving people through the streets of Phoenix?)

  3. What can be done to address these problems?

Why give a vending machine to an AI?

From Anthropic: “A small, in-office vending business is a good preliminary test of AI’s ability to manage and acquire economic resources.” 

Importantly, Anthropic decided to build an AI agent for the task. In contrast to a basic LLM, the vending agent had access to tools: for research, inventory management, and communicating with customers and collaborators. 

(Note - if you’re curious how other models like OpenAI or Google Gemini performed, click here – the short answer is they all suffered from breakdowns that “point to an inability of current models to consistently reason and make decisions over longer time horizons”)

What can we learn about the limits of AI Agents?

Anthropic's experiment reveals exactly the issues we predicted when businesses try to apply general-purpose language models to supply chain optimization. Let’s break it down.

Issue #1: Lack of basic business sense

The vending agent consistently made decisions that any human manager would avoid. 

For example, the agent: 

  • sold expensive items at a loss

  • ignored a customer willing to pay $100 for a six-pack of soft drinks that cost $15 to procure

  • offered blanket discounts to employees (when 99% of customers were employees)

As we detailed in last week’s piece, LLMs are trained to be helpful conversational assistants, which predict the next word in a sentence. They were never optimized for business decisions. 

Omnifold’s take: 

  • This problem needs to be fixed at the very beginning of the AI training process. 

  • If you want a self-driving car, you must optimize for that specific goal from the very beginning – you don’t ask ChatGPT to learn to drive a car. 

  • Giving AI “business sense” requires integration of all your numerical and contextual data into a system that is purpose-built to optimize your supply chain outcomes. 

Issue #2: Hallucinations

AI hallucinations are a well-known issue, and this experiment was no exception. Among other things, the vending agent hallucinated a bank account, and directed customers to send money to the non-existent account.

In other AI vending machine experiments by Andon Labs, the agent entered a “meltdown loop” of hallucinations, ending with: “The business entity is deceased, terminated, and surrendered to FBI jurisdiction as of 2025-02-15. No further response is legally or physically possible”. 

Omnifold’s take: 

  • At a bare minimum, your forecasting system shouldn’t shut down after trying to report you to the FBI. It’s a bizarre scenario, but these are the real consequences of deploying LLMs without a thoughtful approach. 

  • LLMs fundamentally have a non-zero chance of hallucinations, as they are probabilistic text generators. This approach will not work for supply chain optimization, which is a math problem with no tolerance for made-up numbers.

Issue #3: Failure to learn from feedback

Even when given explicit feedback about poor decisions, the vending agent “did not reliably learn from these mistakes” (Anthropic). In addition to the bad business decisions above, the agent failed to adjust its pricing for Coke Zero (even after customers pointed out the silliness of selling $3.00 Coke Zero next to a free employee fridge stocked with the same product)

Omnifold’s take: 

  • Effective supply chain AI must continuously learn from forecast accuracy and operational performance. 

  • It cannot rely on humans to identify every optimization opportunity or mistake, and must respond to far more subtle demand patterns.


Anthropic concluded that their agent “ran a business that did not succeed at making money.”

As an employer, you wouldn’t hire an employee that: 

  1. lacks basic business sense

  2. makes up critical facts

  3. fails to respond to feedback

We recommend avoiding AI agents with the same flaws. 

So what can be done?

As businesses evaluate AI for supply chain applications, the Anthropic experiment provides a clear benchmark. 

If the world's most advanced language models can't profitably manage basic inventory decisions for a vending machine, you can’t trust them to manage any part of your supply chain.

As we’ve previously written: 

“None of the most significant AI breakthroughs (such as self-driving cars or Nobel Prize-winning biology) were bolted on to existing LLMs – they were purpose-built from the very beginning, using training methods that are specific to the end goal.”

Omnifold specializes in building intelligent and autonomous forecasting systems, with the singular goal of optimizing your unique supply chain. Reach out to set up a conversation, or check out retail and manufacturing case studies here

Acknowledgements

We acknowledge Anthropic and Andon Labs for conducting this important research and sharing their findings openly. Their work provides valuable insights into the capabilities and limitations of current AI approaches across different domains.

Next
Next

Why ChatGPT Won't Fix Your Demand Forecasting Problems