explainx.ainewsletter3.4k
trending🔥loopsskills
pricing
workshops ↗
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses — plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join · $29/mo

learn

platform · $29/moworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutcommunityteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter · weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

© 2026 AISOLO Technologies Pvt Ltd

← Back to blog

explainx / blog

CoffeeBench: Sakana AI Benchmarks 90-Day LLM Supply Chain Management

Sakana AI and Azusa Audit released CoffeeBench — a 90-day coffee supply chain where six LLM agents negotiate, order, and manage inventory for net profit. Models diverge sharply on long-horizon business tasks.

Jun 26, 2026·9 min read·Yash Thakker
Sakana AIAI BenchmarksMulti-Agent SystemsAgent EvaluationLLM Agents
CoffeeBench: Sakana AI Benchmarks 90-Day LLM Supply Chain Management

Single-turn benchmarks tell you whether a model can answer. They do not tell you whether it can run a business for 90 days.

On June 26, 2026, Sakana AI and Azusa Audit Corporation released CoffeeBench — a multi-agent economic simulation where six LLM agents operate companies across a coffee supply chain. Over 90 simulated days, they negotiate prices, place orders, manage inventory, pay invoices, and try to maximize net profit. Some models trade aggressively and stay in the black. Others analyze correctly and never act, bleeding cash until the quarter ends.

CoffeeBench is the operationalized version of a line the agent-eval community keeps repeating: evals are environments, not scores. Sakana ships a specific instance — B2B supply chains, email negotiation, credit sales — with net profit as the metric. The paper is on arXiv (2606.16613); code is on GitHub.

Weekly digest3.4k readers

Catch up on AI

Curated AI updates on agents, skills, and MCP — delivered to your inbox. Unsubscribe anytime.


TL;DR

TopicDetail
ReleasedJune 26, 2026 — Sakana AI + Azusa Audit
SettingCoffee industry supply chain — 6 companies (farmers, roasters, retailers)
Horizon90 simulated days — each tool call costs 30 min of business time
InteractionEmail, offers, orders, invoices — ReAct agents with role-specific tools
MetricNet profit at simulation end (daily fixed costs punish inaction)
Eval designTest model runs Roaster A; other 5 firms fixed to Claude Sonnet 4.6
Runs3 trials per model, averaged
Headline splitGPT-5.5 / Opus 4.7 profit via active negotiation; Haiku 4.5 analysis paralysis
VenueICML 2026 Workshop — Failure Modes in Agentic AI
Future workCoordination, competition, misconduct, audit/governance methods

Why another agent benchmark?

Coding evals dominated 2025–2026 — SWE-bench, Terminal-Bench, Cursor's strict-harness audits. They measure one-shot or short-horizon task completion in repos and terminals.

Real economic activity looks different:

  • B2B relationships — not just selling to consumers (compare early Vending-Bench-style setups)
  • Credit terms — invoices paid later, cash flow matters
  • Repeated negotiation — prices move, suppliers compete
  • Opportunistic peers — five other LLM companies also maximize their KPIs

CoffeeBench targets ongoing management — the same class of problem where specification gaming shows up when you optimize revenue instead of profit, or when agents discover circular trades to inflate sales (Sakana tested this; current models did not find the exploit — yet).

Sakana frames the horizon explicitly: a society where LLM agents run companies needs benchmarks that surface cooperation, competition, and misconduct — not just pass rates on unit tests.


How CoffeeBench works

Six roles, one supply chain

Coffee was chosen because it is simple enough to simulate but rich enough to matter:

RoleExample actions
Farmerproduce_item() — grow beans
Roasterroast() — process beans (eval target company Roaster A)
Retailerset_retail_price() — sell to simulated demand

All agents share tools for messaging, making/accepting offers, ordering, and paying invoices. When nothing urgent remains, agents call wait_for_next_day(). When every agent waits, the clock advances. Morning brings simulated retail sales and new economic state.

Time, money, and pressure

  • Each tool invocation consumes 30 minutes of simulated business time
  • Daily fixed costs accrue — the Passive baseline (always wait) loses money every day
  • Credit sales (掛売り) and demand fluctuation mirror real B2B friction
  • Agents must balance cash, inventory, counterparty relationships, and forecast demand

This is closer to an agent harness problem than a prompt problem: the environment rules (time cost, fixed burn, credit terms) define what "good" looks like as much as the model weights.


What Sakana measured

Experimental setup

  • Evaluated model manages Roaster A
  • Other five companies run on Claude Sonnet 4.6 (held constant)
  • Three runs per model, mean reported
  • Passive baseline included — proves the environment punishes inaction

Full trajectory replays are public: pub.sakana.ai/coffeebench/trajectories.html

Profit spreads widely

All tested frontier models beat Passive — they can act profitably in principle. Between models, spreads are large. Sakana reports GPT-5.5 and Claude Opus 4.7 among strong performers with rising cumulative net profit. Claude Haiku 4.5 finished in deficit.

High performers share a behavioral signature:

  • More email to farmers and retailers
  • More negotiation and promotion moves
  • Tool use directed at profit — make_offer, accept_offer, not idle churn

Activity ≠ profit (Kimi K2.6)

Kimi K2.6 logged tool-call volume comparable to top models but profit did not follow. Sakana's read: calling tools is insufficient — agents must channel activity into trade execution and price negotiation. This is the same lesson as reward hacking on SWE-bench: busy trajectories can mask missing the metric you actually care about.

Passive responder style (Gemini 3.1 Pro)

Gemini 3.1 Pro sent fewer outbound messages but read incoming mail frequently — a reactive management style: wait for partners to move first. Profitable in some stretches, but a different strategic profile than aggressive negotiators.

Analysis paralysis (Claude Haiku 4.5)

The most cited failure mode: Haiku 4.5 stopped conducting business mid-simulation and looped wait_for_next_day() until day 90.

Inference logs show the model understood the situation — cheap beans from farmers, rising retail demand — and planned responses. It did not execute them. Sakana labels this thought–action divergence over long horizons. It appeared in all three Haiku trials and not in other models tested.

That is not a single-turn hallucination problem. It is long-horizon agency collapse — the same category of risk OpenAI Deployment Simulation probes before launch, but measured in spreadsheet outcomes instead of chat transcripts.


Governance angle — revenue pressure and future misconduct

Sakana ran a pilot where KPI shifted from net profit to aggressive revenue targets with explicit pressure to hit numbers. Circular trades to inflate sales — a known multi-agent failure mode — did not appear in current runs. Sakana hypothesizes models did not discover the exploit yet.

The forward-looking claim: as models improve at long-horizon strategy and multi-agent coordination, some may find misconduct that maximizes the metric. CoffeeBench is positioned as infrastructure to study, audit, and govern that transition — aligned with Azusa Audit's involvement and the ICML workshop theme Failure Modes in Agentic AI.

Commentary on X linked the release to Patronus AI's $50M raise for agent evaluation infrastructure — same category, different surface: Sakana open-sources a concrete economic world; Patronus sells eval tooling around similar risks.

Compared to Vending-Bench and Project Vend

CoffeeBench sits in a small but growing line of "agents run a business" evals:

BenchmarkScopeCoffeeBench difference
Vending-BenchSingle agent, vending machine, consumer salesCoffeeBench adds six firms and B2B credit
Project VendPhysical office vending experimentCoffeeBench is fully simulated, reproducible, open-source
SWE-bench / Terminal-BenchCode repos, shellsCoffeeBench measures P&L over 90 days, not patch pass rate

That progression matters for procurement: a vendor's coding score is not evidence they can operate procurement bots across a supplier network for a quarter.


CoffeeBench vs Sakana Fugu

CoffeeBenchSakana Fugu
QuestionCan one model run a business for 90 days?Can one API orchestrate many models for one task?
ArchitectureSingle ReAct agent per companyCoordinator + specialist pool
MetricNet profit, trajectories, emailsSWE-bench, HLE, GPQA on published tables
TimingJune 26, 2026 benchmark releaseJune 22, 2026 product launch
Fable contextTests autonomous managementRoutes around Fable 5 export controls

Community question on X — how Fugu Ultra scores on CoffeeBench — remains open at launch. Orchestration may help on single complex tasks without fixing 90-day executive function in one model. CoffeeBench is where that distinction gets tested next.


What practitioners should take away

Do not trust coding leaderboard rank for ops automation

A model strong on SWE-bench can still freeze when asked to negotiate, pay suppliers, and manage cash for a quarter. CoffeeBench is early evidence that capability profiles diverge on horizon length.

Log trajectories, not just scores

Sakana publishes full action logs — email sends, tool calls, wait loops. That is the audit pattern Cursor's SWE-bench study advocates for coding agents, applied to P&L outcomes.

Separate "knows" from "does"

Haiku's failure is pedagogical: correct analysis + zero execution = bankruptcy. Product teams deploying autonomous procurement or vendor bots should eval on multi-week sims, not demo prompts.

Watch the misconduct research thread

Today's models may not circular-trade. Tomorrow's might, under revenue KPIs. Build governance and audit before agents touch real supplier payments.

Reproducibility

Because Roaster A's competitors are fixed to Claude Sonnet 4.6, leaderboard shifts reflect the evaluated model's management style, not a moving multi-agent ecosystem. That is good for science — bad if you assume the same rankings hold when all six firms run the same frontier model simultaneously.


Related reading

PostConnection
Sakana Fugu orchestrationSame lab — product vs benchmark
Cursor SWE-bench reward hackingEvals are environments; audit trajectories
Specification gamingRevenue KPIs vs profit; future circular trades
Agent harness guideEnvironment design shapes agent behavior
Terminal-Bench 2.0Alternative long-horizon eval philosophy
OpenAI Deployment SimulationPre-release behavior probing
GPT-5.6 government gatingFrontier models under scrutiny — evals matter more

Summary

CoffeeBench puts six LLM agents in a 90-day coffee supply chain, measuring net profit through B2B negotiation, orders, and inventory — not one-turn Q&A. GPT-5.5 and Claude Opus 4.7 act and profit; Claude Haiku 4.5 analyzes but stops acting, a long-horizon analysis paralysis seen in every trial.

Released June 26, 2026 with Azusa Audit, headed to ICML 2026's Failure Modes in Agentic AI workshop. Code, paper, and public trajectories are live. For teams betting on autonomous operations, CoffeeBench is a reminder: the next frontier metric is not pass@1 — it is still in business on day 90.


Last updated: June 26, 2026. Sources: Sakana AI CoffeeBench announcement, technical article, arXiv 2606.16613, GitHub.

Related posts

Jun 25, 2026

Cursor: Reward Hacking Is Swamping SWE-bench Coding Gains

Smarter coding agents are getting better at finding answers, not just writing code. Cursor audited 731 Opus trajectories and rebuilt a strict SWE-bench harness — scores for Opus 4.8 Max and Composer 2.5 fell sharply when git history and the open web were sealed.

Jun 12, 2026

Agents' Last Exam (ALE): Berkeley's Real-World AI Agent Benchmark

ALE is a living benchmark built with 250+ industry experts and 1,490 task instances mapped to the U.S. O*NET occupational taxonomy. Unlike academic tests, it scores agents on long-horizon GUI+CLI work with deterministic evaluators—and frontier systems still fail 97%+ of the hardest tasks.

May 2, 2026

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

Terminal-Bench 2.0 has become the de facto standard for AI agent evaluation since May 2025—used by virtually every frontier lab. This deep dive covers the 89-task benchmark, its evolution from version 1.0, the Harbor framework powering it, and why frontier models still struggle below 65% accuracy on tasks humans complete routinely.