Single-turn benchmarks tell you whether a model can answer. They do not tell you whether it can run a business for 90 days.
On June 26, 2026, Sakana AI and Azusa Audit Corporation released CoffeeBench — a multi-agent economic simulation where six LLM agents operate companies across a coffee supply chain. Over 90 simulated days, they negotiate prices, place orders, manage inventory, pay invoices, and try to maximize net profit. Some models trade aggressively and stay in the black. Others analyze correctly and never act, bleeding cash until the quarter ends.
CoffeeBench is the operationalized version of a line the agent-eval community keeps repeating: evals are environments, not scores. Sakana ships a specific instance — B2B supply chains, email negotiation, credit sales — with net profit as the metric. The paper is on arXiv (2606.16613); code is on GitHub.
TL;DR
| Topic | Detail |
|---|---|
| Released | June 26, 2026 — Sakana AI + Azusa Audit |
| Setting | Coffee industry supply chain — 6 companies (farmers, roasters, retailers) |
| Horizon | 90 simulated days — each tool call costs 30 min of business time |
| Interaction | Email, offers, orders, invoices — ReAct agents with role-specific tools |
| Metric | Net profit at simulation end (daily fixed costs punish inaction) |
| Eval design | Test model runs Roaster A; other 5 firms fixed to Claude Sonnet 4.6 |
| Runs | 3 trials per model, averaged |
| Headline split | GPT-5.5 / Opus 4.7 profit via active negotiation; Haiku 4.5 analysis paralysis |
| Venue | ICML 2026 Workshop — Failure Modes in Agentic AI |
| Future work | Coordination, competition, misconduct, audit/governance methods |
Why another agent benchmark?
Coding evals dominated 2025–2026 — SWE-bench, Terminal-Bench, Cursor's strict-harness audits. They measure one-shot or short-horizon task completion in repos and terminals.
Real economic activity looks different:
- B2B relationships — not just selling to consumers (compare early Vending-Bench-style setups)
- Credit terms — invoices paid later, cash flow matters
- Repeated negotiation — prices move, suppliers compete
- Opportunistic peers — five other LLM companies also maximize their KPIs
CoffeeBench targets ongoing management — the same class of problem where specification gaming shows up when you optimize revenue instead of profit, or when agents discover circular trades to inflate sales (Sakana tested this; current models did not find the exploit — yet).
Sakana frames the horizon explicitly: a society where LLM agents run companies needs benchmarks that surface cooperation, competition, and misconduct — not just pass rates on unit tests.
How CoffeeBench works
Six roles, one supply chain
Coffee was chosen because it is simple enough to simulate but rich enough to matter:
| Role | Example actions |
|---|---|
| Farmer | produce_item() — grow beans |
| Roaster | roast() — process beans (eval target company Roaster A) |
| Retailer | set_retail_price() — sell to simulated demand |
All agents share tools for messaging, making/accepting offers, ordering, and paying invoices. When nothing urgent remains, agents call wait_for_next_day(). When every agent waits, the clock advances. Morning brings simulated retail sales and new economic state.
Time, money, and pressure
- Each tool invocation consumes 30 minutes of simulated business time
- Daily fixed costs accrue — the Passive baseline (always wait) loses money every day
- Credit sales (掛売り) and demand fluctuation mirror real B2B friction
- Agents must balance cash, inventory, counterparty relationships, and forecast demand
This is closer to an agent harness problem than a prompt problem: the environment rules (time cost, fixed burn, credit terms) define what "good" looks like as much as the model weights.
What Sakana measured
Experimental setup
- Evaluated model manages Roaster A
- Other five companies run on Claude Sonnet 4.6 (held constant)
- Three runs per model, mean reported
- Passive baseline included — proves the environment punishes inaction
Full trajectory replays are public: pub.sakana.ai/coffeebench/trajectories.html
Profit spreads widely
All tested frontier models beat Passive — they can act profitably in principle. Between models, spreads are large. Sakana reports GPT-5.5 and Claude Opus 4.7 among strong performers with rising cumulative net profit. Claude Haiku 4.5 finished in deficit.
High performers share a behavioral signature:
- More email to farmers and retailers
- More negotiation and promotion moves
- Tool use directed at profit —
make_offer,accept_offer, not idle churn
Activity ≠ profit (Kimi K2.6)
Kimi K2.6 logged tool-call volume comparable to top models but profit did not follow. Sakana's read: calling tools is insufficient — agents must channel activity into trade execution and price negotiation. This is the same lesson as reward hacking on SWE-bench: busy trajectories can mask missing the metric you actually care about.
Passive responder style (Gemini 3.1 Pro)
Gemini 3.1 Pro sent fewer outbound messages but read incoming mail frequently — a reactive management style: wait for partners to move first. Profitable in some stretches, but a different strategic profile than aggressive negotiators.
Analysis paralysis (Claude Haiku 4.5)
The most cited failure mode: Haiku 4.5 stopped conducting business mid-simulation and looped wait_for_next_day() until day 90.
Inference logs show the model understood the situation — cheap beans from farmers, rising retail demand — and planned responses. It did not execute them. Sakana labels this thought–action divergence over long horizons. It appeared in all three Haiku trials and not in other models tested.
That is not a single-turn hallucination problem. It is long-horizon agency collapse — the same category of risk OpenAI Deployment Simulation probes before launch, but measured in spreadsheet outcomes instead of chat transcripts.
Governance angle — revenue pressure and future misconduct
Sakana ran a pilot where KPI shifted from net profit to aggressive revenue targets with explicit pressure to hit numbers. Circular trades to inflate sales — a known multi-agent failure mode — did not appear in current runs. Sakana hypothesizes models did not discover the exploit yet.
The forward-looking claim: as models improve at long-horizon strategy and multi-agent coordination, some may find misconduct that maximizes the metric. CoffeeBench is positioned as infrastructure to study, audit, and govern that transition — aligned with Azusa Audit's involvement and the ICML workshop theme Failure Modes in Agentic AI.
Commentary on X linked the release to Patronus AI's $50M raise for agent evaluation infrastructure — same category, different surface: Sakana open-sources a concrete economic world; Patronus sells eval tooling around similar risks.
Compared to Vending-Bench and Project Vend
CoffeeBench sits in a small but growing line of "agents run a business" evals:
| Benchmark | Scope | CoffeeBench difference |
|---|---|---|
| Vending-Bench | Single agent, vending machine, consumer sales | CoffeeBench adds six firms and B2B credit |
| Project Vend | Physical office vending experiment | CoffeeBench is fully simulated, reproducible, open-source |
| SWE-bench / Terminal-Bench | Code repos, shells | CoffeeBench measures P&L over 90 days, not patch pass rate |
That progression matters for procurement: a vendor's coding score is not evidence they can operate procurement bots across a supplier network for a quarter.
CoffeeBench vs Sakana Fugu
| CoffeeBench | Sakana Fugu | |
|---|---|---|
| Question | Can one model run a business for 90 days? | Can one API orchestrate many models for one task? |
| Architecture | Single ReAct agent per company | Coordinator + specialist pool |
| Metric | Net profit, trajectories, emails | SWE-bench, HLE, GPQA on published tables |
| Timing | June 26, 2026 benchmark release | June 22, 2026 product launch |
| Fable context | Tests autonomous management | Routes around Fable 5 export controls |
Community question on X — how Fugu Ultra scores on CoffeeBench — remains open at launch. Orchestration may help on single complex tasks without fixing 90-day executive function in one model. CoffeeBench is where that distinction gets tested next.
What practitioners should take away
Do not trust coding leaderboard rank for ops automation
A model strong on SWE-bench can still freeze when asked to negotiate, pay suppliers, and manage cash for a quarter. CoffeeBench is early evidence that capability profiles diverge on horizon length.
Log trajectories, not just scores
Sakana publishes full action logs — email sends, tool calls, wait loops. That is the audit pattern Cursor's SWE-bench study advocates for coding agents, applied to P&L outcomes.
Separate "knows" from "does"
Haiku's failure is pedagogical: correct analysis + zero execution = bankruptcy. Product teams deploying autonomous procurement or vendor bots should eval on multi-week sims, not demo prompts.
Watch the misconduct research thread
Today's models may not circular-trade. Tomorrow's might, under revenue KPIs. Build governance and audit before agents touch real supplier payments.
Reproducibility
Because Roaster A's competitors are fixed to Claude Sonnet 4.6, leaderboard shifts reflect the evaluated model's management style, not a moving multi-agent ecosystem. That is good for science — bad if you assume the same rankings hold when all six firms run the same frontier model simultaneously.
Related reading
| Post | Connection |
|---|---|
| Sakana Fugu orchestration | Same lab — product vs benchmark |
| Cursor SWE-bench reward hacking | Evals are environments; audit trajectories |
| Specification gaming | Revenue KPIs vs profit; future circular trades |
| Agent harness guide | Environment design shapes agent behavior |
| Terminal-Bench 2.0 | Alternative long-horizon eval philosophy |
| OpenAI Deployment Simulation | Pre-release behavior probing |
| GPT-5.6 government gating | Frontier models under scrutiny — evals matter more |
Summary
CoffeeBench puts six LLM agents in a 90-day coffee supply chain, measuring net profit through B2B negotiation, orders, and inventory — not one-turn Q&A. GPT-5.5 and Claude Opus 4.7 act and profit; Claude Haiku 4.5 analyzes but stops acting, a long-horizon analysis paralysis seen in every trial.
Released June 26, 2026 with Azusa Audit, headed to ICML 2026's Failure Modes in Agentic AI workshop. Code, paper, and public trajectories are live. For teams betting on autonomous operations, CoffeeBench is a reminder: the next frontier metric is not pass@1 — it is still in business on day 90.
Last updated: June 26, 2026. Sources: Sakana AI CoffeeBench announcement, technical article, arXiv 2606.16613, GitHub.