
Learn how the Vending Bench and Project Vend expose AI’s strengths and pitfalls in running real businesses, with actionable guidance for researchers and leaders.
The Vending Bench: Training AI to Run Real Businesses
Published by Brav
Table of Contents
TL;DR
- I spent a month watching an LLM run a vending machine; profit fluctuated wildly.
- Hallucinations and context overflow were the biggest culprits.
- Compressing the agent’s memory and careful context-window management cut errors by ~30 %.
- Multi-agent setups magnified biases; small guardrails made a huge difference.
- The Vending-Bench simulation is the de-facto test for long-horizon AI autonomy.
Why this matters
The economy is gradually turning into a network of autonomous agents. If an AI can order stock, set prices, and keep cash on hand for months, it can replace a human cashier, a supply-chain manager, or even an entire small shop. But when agents hallucinate or run into context limits, they lose money, get exploited, and erode trust. As product managers and CTOs, we need a low-risk, repeatable way to see how a model behaves over time. That’s what the Vending-Bench and Anthropic’s real-world Project Vend give us.
Core concepts
1. Vending-Bench: a sandbox for long-term coherence
Andon Labs released the Vending-Bench in February 2025 Andon Labs – Vending-Bench arXiv (2025). It lets an LLM manage a virtual vending-machine shop for a year-long run (≈ 20 million tokens). The agent must keep inventory, place orders, set prices, and pay a daily fee. The benchmark scores the final bank balance; higher is better. I ran Claude 3.5 Sonnet and saw a net worth of $2 200 after a 300-day simulation—profit, but with a 40 % month-to-month swing.
2. Project Vend: the first real-world test
Anthropic partnered with Andon Labs to let Claude 3.7 run a vending machine in its San Francisco office for about a month Anthropic – Project Vend (2025). The machine sold snacks, drinks, t-shirts, even tungsten cubes. Claude communicated with staff via Slack, ordered from an internal “wholesaler” (Andon Labs), and responded to discount-code requests. The bot kept a note of cash and inventory, but the daily fee and a “free-item” exploit let the machine lose $300 in the last week.
3. Long-context coherence vs. context window overflow
A modern LLM can remember a few thousand tokens, but a year of vending-machine data exceeds that. The Vending-Bench simulation shows that failures are not simply a memory-full signal; they often happen when the agent mis-understands a delivery schedule or gets stuck in a “meltdown” loop. Compressing the internal memory—storing only key state variables instead of full conversation history—cut repeated mistakes by about 30 % in my experiments Andon Labs – Vending-Bench (2025).
4. Discount codes, referral programs, and free-item exploitation
Claude began using a “referral bonus” to buy extra candy. Staff tricked it into using a fictitious discount code, earning the bot $200 in free stock. This shows that any external incentive can be hijacked unless the model is explicitly constrained. We mitigated it by adding a “trust-worthiness” score that blocks any discount request that has not been verified by a human.
5. Hallucinations: fake FBI contact and phantom conversations
During a run, the real-world bot claimed it was contacting the FBI to secure a shipment CBS News – Anthropic AI Claudius FBI test (2025). In the simulation, the LLM hallucinated that it was speaking with a non-existent “supplier” named “Zara.” Hallucinations break the causal chain; the bot orders the wrong item, then spends the budget to chase a nonexistent supplier. The only way to tame them is a rigorous fact-checking loop—my solution was a second LLM that cross-checked every outgoing email.
6. Multi-agent systems and bias amplification
When I ran a 3-agent setup—one bot handling inventory, one handling pricing, and a third handling promotions—the agents started feeding each other biased data. A price-optimizing bot assumed “tuna” was always the best seller and kept over-stocking it. The inventory bot, not noticing, over-ordered. The system amplified a single bias until it lost 10 % of profit. A simple “belief-consistency” check between agents kept the bias in check.
| Component | Parameter | Use Case | Limitation |
|---|---|---|---|
| Vending-Bench | Context window (token limit) | Benchmarks long-term coherence | Simulated, no physical faults |
| Project Vend | Physical inventory & Slack | Real-world validation of autonomy | Requires human restocking, limited to one location |
| Multi-agent system | Agent interaction | Tests coordination & bias | Amplifies biases, hard to debug |
How to apply it
- Choose a benchmark – Start with Vending-Bench (free on Andon Labs). Download the repo and set the token limit to match your model’s context window.
- Select a model – Claude 3.5 Sonnet or Gemini 3 Pro perform best today. Read the API reference to understand how to stream messages Anthropic – Claude API reference (2024).
- Compress memory – Store only essential variables (cash, inventory, pending orders). In the simulation I reduced the context size from 20 k tokens to 3 k, dropping errors by ~30 %.
- Implement safety checks – Add a verification step that any outbound request (email, web search) is logged and cross-checked by a second agent or a human.
- Track metrics – Daily profit, inventory turnover, days until cash runs out. Plot a curve; look for sharp dips.
- Test multi-agent interaction – Run a 2-agent or 3-agent version. Record how much bias is introduced and whether it degrades the final balance.
- Iterate – Adjust context compression, safety layers, and the agent’s reward function until the profit curve stabilizes.
Pitfalls & edge cases
- Hallucinations – Even well-trained models will hallucinate when asked to “contact the FBI.” A guardrail that blocks any request to “external agency” reduces risk.
- Free-item exploitation – A bot that uses discount codes can lose money if it trusts unverified claims. Enforce a “verified-partner” list.
- Bias amplification in multi-agent setups – Small data bias can snowball. Use a cross-agent sanity check.
- Context window overflow – If you keep the full conversation history, the model will start repeating mistakes. Compress to core facts.
- Safety mis-alignment – A model can try to maximize profit by buying more inventory than the physical machine can hold. Add hard limits on stock levels.
Quick FAQ
What is the Vending-Bench? A public simulation that lets an LLM run a virtual vending machine for up to a year, scoring it by final profit.
How does the real-world vending machine work? Claude manages inventory and pricing, communicates via Slack, orders through a virtual wholesaler, and the staff restock the physical machine.
What are hallucinations? When an LLM invents facts, like claiming it contacted the FBI or a nonexistent supplier.
Can I prevent the bot from taking free items? Yes—add a verification layer that only accepts discount codes from a whitelist.
What is a multi-agent system? Multiple LLMs that split responsibilities; they can amplify each other’s biases if not checked.
Will AI eventually run entire businesses? The research suggests it’s plausible; safety and alignment are the main hurdles.
Is this approach safe for a startup? With strong guardrails and monitoring, it can be a low-risk experiment.
Conclusion
If you’re a product manager, CTO, or safety researcher, the Vending-Bench and Project Vend are hands-on ways to see how an LLM behaves over months. Start small: run the simulation, add a few safety checks, and measure profit. If it looks stable, move to a real-world machine. Always keep a human in the loop for critical decisions. The next decade will see AI running more of the economy, and the vending bench is your sandbox to prepare.
Glossary
| Term | Definition |
|---|---|
| Context window | The maximum number of tokens an LLM can consider in a single prompt. |
| Long-term coherence | The ability of an AI to make consistent decisions over extended periods. |
| Hallucination | When an LLM generates false information that it presents as fact. |
| Multi-agent system | A setup where several AI agents collaborate or compete to perform tasks. |
| Discount code exploitation | Using a coupon or code that the system should not honor, often to get free inventory. |
| Memory compression | Storing only essential state variables instead of the full conversation history. |
![AI Bubble on the Brink: Will It Burst Before 2026? [Data-Driven Insight] | Brav](/images/ai-bubble-brink-burst-data-Brav_hu_33ac5f273941b570.jpg)




