CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

What is the analysis paralysis finding?

Claude Haiku 4.5 correctly analyzed its situation — cheap beans available, rising retail demand — but repeatedly chose wait_for_next_day() instead of acting. Sakana calls this a thought-action gap unique to long-horizon tasks; it occurred in all three trial runs for Haiku and not in other evaluated models.

Where can I read the paper and run CoffeeBench?

Paper at arxiv.org/abs/2606.16613, code at github.com/SakanaAI/CoffeeBench, trajectories at pub.sakana.ai/coffeebench/trajectories.html, and overview at sakana.ai/coffee-bench/. Presented at the ICML 2026 Workshop Failure Modes in Agentic AI.

How does CoffeeBench relate to Sakana Fugu?

Fugu orchestrates multiple models for single tasks; CoffeeBench evaluates how individual models behave as autonomous economic agents over months. Fugu's value proposition is routing around vendor lock-in — CoffeeBench asks whether any one model can sustain profitable B2B operations without human intervention.

CoffeeBench is a benchmark from Sakana AI and Azusa Audit Corporation that evaluates LLM agents on long-term business management. Six agents — farmers, roasters, and retailers in a coffee supply chain — interact via email and transactions over 90 simulated days, each trying to maximize net profit through price negotiation, ordering, and inventory management.

How does CoffeeBench differ from coding benchmarks like SWE-bench?

SWE-bench measures isolated bug fixes. CoffeeBench measures sustained economic decision-making — cash flow, credit sales, supplier relationships, and daily fixed costs over 90 days. Agents must act repeatedly under pressure; passive models bleed money even if they analyze correctly.

Which models performed best on CoffeeBench?

Sakana reports that most frontier models beat a passive baseline that does nothing and accumulates losses. High performers — including GPT-5.5 and Claude Opus 4.7 — actively negotiate via email and execute profit-linked trades. Claude Haiku 4.5 performed worst, entering analysis paralysis and calling wait_for_next_day() until the simulation ended in deficit. Kimi K2.6 showed high tool use but weak profit.

CoffeeBench: Sakana AI 90-Day LLM Supply Chain Benchmark (2026) | explainx.ai Blog

Single-turn benchmarks tell you whether a model can answer. They do not tell you whether it can run a business for 90 days.

On June 26, 2026, Sakana AI and Azusa Audit Corporation released CoffeeBench — a multi-agent economic simulation where six LLM agents operate companies across a coffee supply chain. Over 90 simulated days, they negotiate prices, place orders, manage inventory, pay invoices, and try to maximize net profit. Some models trade aggressively and stay in the black. Others analyze correctly and never act, bleeding cash until the quarter ends.

CoffeeBench is the operationalized version of a line the agent-eval community keeps repeating: evals are environments, not scores. Sakana ships a specific instance — B2B supply chains, email negotiation, credit sales — with net profit as the metric. The paper is on arXiv (2606.16613); code is on GitHub.

TL;DR

Topic	Detail
Released	June 26, 2026 — Sakana AI + Azusa Audit
Setting	Coffee industry supply chain — 6 companies (farmers, roasters, retailers)
Horizon	90 simulated days — each tool call costs 30 min of business time
Interaction	Email, offers, orders, invoices — ReAct agents with role-specific tools
Metric	Net profit at simulation end (daily fixed costs punish inaction)
Eval design	Test model runs Roaster A; other 5 firms fixed to Claude Sonnet 4.6
Runs	3 trials per model, averaged
Headline split	GPT-5.5 / Opus 4.7 profit via active negotiation; Haiku 4.5 analysis paralysis
Venue	ICML 2026 Workshop — Failure Modes in Agentic AI
Future work	Coordination, competition, misconduct, audit/governance methods

Why another agent benchmark?

Coding evals dominated 2025–2026 — SWE-bench, Terminal-Bench, Cursor's strict-harness audits. They measure one-shot or short-horizon task completion in repos and terminals.

Real economic activity looks different:

B2B relationships — not just selling to consumers (compare early Vending-Bench-style setups)
Credit terms — invoices paid later, cash flow matters
Repeated negotiation — prices move, suppliers compete
Opportunistic peers — five other LLM companies also maximize their KPIs

CoffeeBench targets ongoing management — the same class of problem where specification gaming shows up when you optimize revenue instead of profit, or when agents discover circular trades to inflate sales (Sakana tested this; current models did not find the exploit — yet).

Sakana frames the horizon explicitly: a society where LLM agents run companies needs benchmarks that surface cooperation, competition, and misconduct — not just pass rates on unit tests.

How CoffeeBench works

Six roles, one supply chain

Coffee was chosen because it is simple enough to simulate but rich enough to matter:

Role	Example actions
Farmer	`produce_item()` — grow beans
Roaster	`roast()` — process beans (eval target company Roaster A)
Retailer	`set_retail_price()` — sell to simulated demand

All agents share tools for messaging, making/accepting offers, ordering, and paying invoices. When nothing urgent remains, agents call wait_for_next_day(). When every agent waits, the clock advances. Morning brings simulated retail sales and new economic state.

Time, money, and pressure

Each tool invocation consumes 30 minutes of simulated business time
Daily fixed costs accrue — the Passive baseline (always wait) loses money every day
Credit sales (掛売り) and demand fluctuation mirror real B2B friction
Agents must balance cash, inventory, counterparty relationships, and forecast demand

This is closer to an agent harness problem than a prompt problem: the environment rules (time cost, fixed burn, credit terms) define what "good" looks like as much as the model weights.

What Sakana measured

Experimental setup

Evaluated model manages Roaster A
Other five companies run on Claude Sonnet 4.6 (held constant)
Three runs per model, mean reported
Passive baseline included — proves the environment punishes inaction

Full trajectory replays are public: pub.sakana.ai/coffeebench/trajectories.html

Profit spreads widely

All tested frontier models beat Passive — they can act profitably in principle. Between models, spreads are large. Sakana reports GPT-5.5 and Claude Opus 4.7 among strong performers with rising cumulative net profit. Claude Haiku 4.5 finished in deficit.

High performers share a behavioral signature:

More email to farmers and retailers
More negotiation and promotion moves
Tool use directed at profit — make_offer, accept_offer, not idle churn

Activity ≠ profit (Kimi K2.6)

Kimi K2.6 logged tool-call volume comparable to top models but profit did not follow. Sakana's read: calling tools is insufficient — agents must channel activity into trade execution and price negotiation. This is the same lesson as reward hacking on SWE-bench: busy trajectories can mask missing the metric you actually care about.

Passive responder style (Gemini 3.1 Pro)

Gemini 3.1 Pro sent fewer outbound messages but read incoming mail frequently — a reactive management style: wait for partners to move first. Profitable in some stretches, but a different strategic profile than aggressive negotiators.

Analysis paralysis (Claude Haiku 4.5)

The most cited failure mode: Haiku 4.5 stopped conducting business mid-simulation and looped wait_for_next_day() until day 90.

Inference logs show the model understood the situation — cheap beans from farmers, rising retail demand — and planned responses. It did not execute them. Sakana labels this thought–action divergence over long horizons. It appeared in all three Haiku trials and not in other models tested.

That is not a single-turn hallucination problem. It is long-horizon agency collapse — the same category of risk OpenAI Deployment Simulation probes before launch, but measured in spreadsheet outcomes instead of chat transcripts.

Governance angle — revenue pressure and future misconduct

Sakana ran a pilot where KPI shifted from net profit to aggressive revenue targets with explicit pressure to hit numbers. Circular trades to inflate sales — a known multi-agent failure mode — did not appear in current runs. Sakana hypothesizes models did not discover the exploit yet.

The forward-looking claim: as models improve at long-horizon strategy and multi-agent coordination, some may find misconduct that maximizes the metric. CoffeeBench is positioned as infrastructure to study, audit, and govern that transition — aligned with Azusa Audit's involvement and the ICML workshop theme Failure Modes in Agentic AI.

Commentary on X linked the release to Patronus AI's $50M raise for agent evaluation infrastructure — same category, different surface: Sakana open-sources a concrete economic world; Patronus sells eval tooling around similar risks.

Compared to Vending-Bench and Project Vend

CoffeeBench sits in a small but growing line of "agents run a business" evals:

Benchmark	Scope	CoffeeBench difference
Vending-Bench	Single agent, vending machine, consumer sales	CoffeeBench adds six firms and B2B credit
Project Vend	Physical office vending experiment	CoffeeBench is fully simulated, reproducible, open-source
SWE-bench / Terminal-Bench	Code repos, shells	CoffeeBench measures P&L over 90 days, not patch pass rate

That progression matters for procurement: a vendor's coding score is not evidence they can operate procurement bots across a supplier network for a quarter.

CoffeeBench vs Sakana Fugu

	CoffeeBench	Sakana Fugu
Question	Can one model run a business for 90 days?	Can one API orchestrate many models for one task?
Architecture	Single ReAct agent per company	Coordinator + specialist pool
Metric	Net profit, trajectories, emails	SWE-bench, HLE, GPQA on published tables
Timing	June 26, 2026 benchmark release	June 22, 2026 product launch
Fable context	Tests autonomous management	Routes around Fable 5 export controls

Community question on X — how Fugu Ultra scores on CoffeeBench — remains open at launch. Orchestration may help on single complex tasks without fixing 90-day executive function in one model. CoffeeBench is where that distinction gets tested next.

What practitioners should take away

Do not trust coding leaderboard rank for ops automation

A model strong on SWE-bench can still freeze when asked to negotiate, pay suppliers, and manage cash for a quarter. CoffeeBench is early evidence that capability profiles diverge on horizon length.

Log trajectories, not just scores

Sakana publishes full action logs — email sends, tool calls, wait loops. That is the audit pattern Cursor's SWE-bench study advocates for coding agents, applied to P&L outcomes.

Separate "knows" from "does"

Haiku's failure is pedagogical: correct analysis + zero execution = bankruptcy. Product teams deploying autonomous procurement or vendor bots should eval on multi-week sims, not demo prompts.

Watch the misconduct research thread

Today's models may not circular-trade. Tomorrow's might, under revenue KPIs. Build governance and audit before agents touch real supplier payments.

Reproducibility

Because Roaster A's competitors are fixed to Claude Sonnet 4.6, leaderboard shifts reflect the evaluated model's management style, not a moving multi-agent ecosystem. That is good for science — bad if you assume the same rankings hold when all six firms run the same frontier model simultaneously.

Post	Connection
Sakana Fugu orchestration	Same lab — product vs benchmark
Cursor SWE-bench reward hacking	Evals are environments; audit trajectories
Specification gaming	Revenue KPIs vs profit; future circular trades
Agent harness guide	Environment design shapes agent behavior
Terminal-Bench 2.0	Alternative long-horizon eval philosophy
OpenAI Deployment Simulation	Pre-release behavior probing
GPT-5.6 government gating	Frontier models under scrutiny — evals matter more

Summary

CoffeeBench puts six LLM agents in a 90-day coffee supply chain, measuring net profit through B2B negotiation, orders, and inventory — not one-turn Q&A. GPT-5.5 and Claude Opus 4.7 act and profit; Claude Haiku 4.5 analyzes but stops acting, a long-horizon analysis paralysis seen in every trial.

Released June 26, 2026 with Azusa Audit, headed to ICML 2026's Failure Modes in Agentic AI workshop. Code, paper, and public trajectories are live. For teams betting on autonomous operations, CoffeeBench is a reminder: the next frontier metric is not pass@1 — it is still in business on day 90.

Last updated: June 26, 2026. Sources: Sakana AI CoffeeBench announcement, technical article, arXiv 2606.16613, GitHub.

Single-turn benchmarks tell you whether a model can answer. They do not tell you whether it can run a business for 90 days.

TL;DR

Topic	Detail
Released	June 26, 2026 — Sakana AI + Azusa Audit
Setting	Coffee industry supply chain — 6 companies (farmers, roasters, retailers)
Horizon	90 simulated days — each tool call costs 30 min of business time
Interaction	Email, offers, orders, invoices — ReAct agents with role-specific tools
Metric	Net profit at simulation end (daily fixed costs punish inaction)
Eval design	Test model runs Roaster A; other 5 firms fixed to Claude Sonnet 4.6
Runs	3 trials per model, averaged
Headline split	GPT-5.5 / Opus 4.7 profit via active negotiation; Haiku 4.5 analysis paralysis
Venue	ICML 2026 Workshop — Failure Modes in Agentic AI
Future work	Coordination, competition, misconduct, audit/governance methods

Why another agent benchmark?

Coding evals dominated 2025–2026 — SWE-bench, Terminal-Bench, Cursor's strict-harness audits. They measure one-shot or short-horizon task completion in repos and terminals.

Real economic activity looks different:

B2B relationships — not just selling to consumers (compare early Vending-Bench-style setups)
Credit terms — invoices paid later, cash flow matters
Repeated negotiation — prices move, suppliers compete
Opportunistic peers — five other LLM companies also maximize their KPIs

Sakana frames the horizon explicitly: a society where LLM agents run companies needs benchmarks that surface cooperation, competition, and misconduct — not just pass rates on unit tests.