Every spring, the AI benchmarking landscape shifts. What was once challenging becomes saturated. What differentiated frontier models becomes statistical noise. But in May 2025, the Laude Institute, Stanford University, and Snorkel AI released something different: Terminal-Bench 1.0—a benchmark that became, in the words of the creators, a "runaway success," adopted by virtually every frontier lab within months.
Six months later, in November 2025, the team released Terminal-Bench 2.0. This wasn't just an incremental update. It was a proactive response to saturation—raising the bar before models conquered version 1.0, while fixing quality issues the community discovered through intensive usage. The result: 89 carefully curated tasks where frontier models still score below 65%, each task receiving approximately 3 reviewer-hours of human auditing to ensure it's solvable, realistic, and well-specified.
This post does four things: it explains what Terminal-Bench 2.0 actually measures, how it differs from other agent benchmarks like SWE-bench and GAIA, the Harbor framework that powers it, and what the current leaderboard tells us about the state of AI agents in 2026.
What Terminal-Bench 2.0 Actually Measures
The Core Thesis
Terminal-Bench 2.0 is a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Unlike academic benchmarks that test knowledge recall or isolated code generation, Terminal-Bench measures:
- Operational reliability in live tool-driven environments
- Multi-step planning and execution across complex workflows
- Recovery capabilities when errors occur
- Real-world task completion ranging from compiling code to training models and setting up servers
- Tool use ability to operate a computer via terminal autonomously
Each task features a unique environment, a human-written solution, and comprehensive tests for verification. Tasks must be completed using only Bash commands through a headless terminal—no GUI, no shortcuts, no structured output templates to lean on.
Task Structure and Evaluation
Every Terminal-Bench 2.0 task consists of:
- Containerized environment initialized with relevant packages and files (all dependencies pinned for reproducibility)
- Natural language instruction describing the task to complete
- Comprehensive pytest test suite to verify completion
- Human-written reference solution manually created to solve the task
Scoring is binary and strict: Pass@1 only. Models must pass ALL pytest validation tests to receive credit for a task. A task with 10 tests where 9 pass still scores 0. There are no multiple attempts, no partial credit, no second chances.
Verification Process
The evaluation pipeline is deterministic and transparent:
- Install
uvpackage manager - Use
uvxto install pytest and task-specific dependencies with pinned versions - Run pytest with specific formatting flags
- Generate two outputs:
- Detailed test report (Common Test Reporting Format via pytest-json-ctrf plugin)
- Single binary success file (1 or 0)
This approach eliminates the ambiguity and gaming vulnerabilities that plague LLM-judged benchmarks.
The Task Taxonomy: What Agents Actually Face
Terminal-Bench 2.0 covers diverse domains that reflect real developer and system administrator workflows:
Software Engineering
- Build systems and compilation: Navigate makefiles, dependency chains, compiler flags
- Dependency resolution: Install correct package versions, resolve conflicts
- Git operations with merge conflicts: Real repository manipulation beyond simple commits
- COBOL-to-Python rewrites: Legacy code migration requiring language understanding
- Code coverage analysis with gcov: Development tooling proficiency
Security & Cryptography
- Differential cryptanalysis on cipher systems: Advanced security knowledge
- Password recovery: Security testing and cracking techniques
- Vulnerability identification: Code auditing for security flaws
- API key removal from codebases: Security hygiene and scanning
Machine Learning & Data Science
- Training fastText models on Yelp data with accuracy/size constraints: Real ML pipelines with competing objectives
- Neural network framework integration: Model deployment and optimization
- Model optimization: Balancing accuracy, size, and performance
System Administration
- Server setup and configuration: Production-like infrastructure tasks
- Building and running Linux from source code: Deep systems knowledge
Domain-Specific Tasks
- Biology/computational tasks requiring specialized domain knowledge
- Chess engine move optimization: Algorithm implementation and testing
- Physics-based rendering: Scientific computing
- Video processing: Multimedia manipulation
- Personal assistant tasks: Real-world automation workflows
Difficulty Scaling
Tasks range from easy to hard:
- Easy tasks: ~65% average accuracy across models
- Hard tasks: ~16% average accuracy across models
These difficulty labels are author-estimated for humans and may not reflect agent difficulty—some "easy" tasks trip up frontier models while some "hard" tasks fall to clever tool use.
Version 2.0: What Changed and Why
The Quality Problem in Version 1.0
Terminal-Bench 1.0 launched in May 2025 with 80 tasks and became an instant success. But success brought scrutiny. The community—including the researchers themselves—discovered problems:
- Several tasks were unsolvable for artificial reasons (configuration issues, environment problems)
- Some set arbitrary thresholds that didn't reflect task completion
- Others lacked robustness—solutions that worked one day failed the next (the "download-youtube" task became notorious for breaking with YouTube's constantly changing anti-bot protections)
As frontier models climbed over 50% success rate on version 1.0, the benchmark risked becoming another saturated metric.
The 2.0 Response: Proactive Difficulty Scaling
Instead of waiting for saturation, the team raised the bar in November 2025:
Task Quality and Verification
- 89 tasks (up from 80) with substantial manual and LM-assisted verification
- Each task received approximately 3 reviewer-hours of human auditing
- Tasks verified to be: (1) solvable, (2) realistic, and (3) well-specified
- Problematic tasks eliminated entirely
Increased Difficulty
- Version 1.0: Frontier models climbed over 50%
- Version 2.0: Frontier models score less than 65%
- The benchmark now better represents frontier challenges that distinguish truly capable agents
Technical Infrastructure Upgrade
-
Version 1.0:
- Used structured outputs to enforce response schema
- Run on EC2 instances using Docker containers locally
-
Version 2.0:
- Does NOT use structured outputs (invalid/missing JSON retried with warning)
- Run remotely using Daytona managed containers for better scalability
- Uses new Harbor framework (complete rewrite for improved reliability, observability, scalability, and performance)
Specific Improvements
- Eliminated environment-dependent failures (like the YouTube anti-bot problem)
- Better separation of task specification from implementation details
- Improved test coverage and edge case handling
- Clearer documentation for each task
The Harbor Framework: How Terminal-Bench Actually Runs
Why Harbor Exists
The original Terminal-Bench 1.0 harness worked, but scaling to thousands of evaluations across 16 frontier models and 6 state-of-the-art agents revealed bottlenecks:
- Local execution on EC2 instances maxed out at 4-6 containers before hitting CPU, memory, and I/O constraints
- Manual orchestration slowed iteration cycles
- Limited observability made debugging agent failures difficult
- No standardized interface for testing custom agents
Harbor is the answer: an open-source framework for evaluating and optimizing agents in container environments.
Harbor Architecture
Core Components:
- Harbor task format: Standardized specification for agent tasks
- Harbor harness: Execution engine for running agents against tasks
- Harbor registry: Centralized task distribution (
harbor run -d terminal-bench@2.0) - Daytona integration: Managed runtime for scalable, reproducible sandboxes
Supported Agents:
- Claude Code
- Codex CLI
- OpenHands
- Mini-SWE-Agent
- Terminus 2 (simple neutral scaffold for comparing raw model performance)
Custom Agent Support: Developers can create custom agents by subclassing BaseInstalledAgent or BaseAgent, receiving the instruction and Docker container, then exploring/manipulating the environment through tool calls (editing files, running Bash commands).
Daytona: From Local to Cloud
The Evolution:
- Initial approach: Docker sandboxes on local machines (4-6 containers max)
- Problem: CPU, memory, and I/O constraints made it impractical for thousands of tests
- Solution: Daytona managed runtime
- Provisions long-running, reproducible sandboxes at scale
- Supports thousands of parallel experiments simultaneously
- For tasks with unique specs, teams submit Dockerfiles to Daytona's Declarative Image Builder
- Containers built and executed on demand
This infrastructure enabled the 32,155 total trials across models and agents that powered the 2.0 leaderboard.
Security and Contamination Prevention
Sandboxing: Each task runs in isolated Docker containers to prevent cross-contamination and ensure reproducibility.
Contamination Detection: Includes Big-Bench canary string in each repository file to aid in training corpus decontamination. The team acknowledges that private test sets are considered out of scope due to the community investment required, but the canary string provides transparency for model developers.
The 2026 Leaderboard: What Models Can (and Can't) Do
Direct Model Performance
As of 2026, the Terminal-Bench 2.0 leaderboard shows:
Top Direct Model Scores:
- GPT-5.5: 73.20% (leading)
- Claude Opus 4.7: 68.54%
- Gemini 3.1 Pro Preview: 67.42%
- GPT-5.3 Codex: 64.05%
- Claude Sonnet 4.6 / Muse Spark: 59.55% (tied)
The 65% Ceiling: Despite frontier models approaching human-level performance on many academic benchmarks (MMMU-Pro models within 0.3 points of human experts), they still fail on 18-35% of Terminal-Bench 2.0 tasks—tasks that experienced developers complete routinely.
Agent + Model Combinations
Top Agent Scores (combining agent scaffolding with frontier models):
- ForgeCode + Claude Opus 4.6: 81.8% (top score)
- ForgeCode + GPT-5.4: 81.8% (tied for top)
- TongAgents + Gemini 3.1 Pro: 80.2%
- SageAgent + GPT-5.3-Codex: 78.4%
- ForgeCode + Gemini 3.1 Pro: 78.4%
The Agent Scaffolding Effect: The same model can perform very differently with different agent implementations. For example, Gemini 2.5 Pro's pass rate improved 17% with Terminus 2 scaffolding over OpenHands—demonstrating that agent design matters significantly.
Evaluated Agents
The benchmark has tested 6 state-of-the-art agents across 16 frontier models with 32,155 total trials:
- Terminus 2: Simple neutral scaffold for comparing model performance
- Claude Code: Anthropic's agent implementation
- Codex CLI: OpenAI's command-line agent
- OpenHands: Open-source agentic framework
- Mini-SWE-Agent: Lightweight software engineering agent
- ForgeCode, TongAgents, SageAgent: Specialized agent frameworks
Performance by Difficulty
- Easy tasks: ~65% average accuracy across frontier models
- Hard tasks: ~16% average accuracy across frontier models
The 49-point gap between easy and hard tasks affects all models uniformly, suggesting that current agent architectures struggle with similar bottlenecks regardless of underlying model capability.
Terminal-Bench vs. Other Agent Benchmarks
Terminal-Bench vs. SWE-bench
SWE-bench (Verified: 500 tasks, Pro: 731 tasks):
- Focus: GitHub issue resolution and software patch correctness
- Task: Receive issue description + repo snapshot → produce patch that passes test suite
- Scoring: Patch must pass issue's associated test suite
- Domain: Software engineering only
- 2026 Leaders: Claude (77.2%), GPT-5 (74.9%)
Terminal-Bench (89 tasks):
- Focus: Operational reliability across diverse terminal tasks
- Task: Natural language instruction → complete task using Bash commands
- Scoring: Must pass comprehensive pytest validation
- Domain: Cross-domain (ML, security, system admin, data science, etc.)
- 2026 Leaders: GPT-5.5 (73.20% direct), ForgeCode combos (81.8%)
Key Difference: SWE-bench measures software-engineering proficiency; Terminal-Bench measures broader operational capabilities and tool-use accuracy. A model can excel at one and struggle at the other—they test distinct skill sets.
Terminal-Bench vs. GAIA
GAIA (466 tasks):
- Focus: Multi-step reasoning with diverse tasks
- Task: Questions requiring multi-step reasoning (e.g., "Find country's GDP and convert currency")
- Scoring: Answers scored against human-annotated ground truth + LLM judge for paraphrasing
- Nature: Question-answering
- 2026 Leaders: Claude Mythos Preview (52.3%), GPT-5.4 Pro (50.5%)
Terminal-Bench:
- Focus: Action-taking in terminal environments
- Task: End-to-end task completion via command execution
- Scoring: Deterministic based on exit codes, file diffs, output strings
- Nature: Task execution
Key Difference: GAIA tests general-assistant reasoning capability; Terminal-Bench tests action execution in live environments. GAIA asks "Can you figure out the answer?"; Terminal-Bench asks "Can you actually make it happen?"
Performance Variability Across Benchmarks
Models show different strengths across benchmarks. In the same week, a model might achieve:
- 87% on SWE-bench Verified (software engineering)
- 44% on GAIA (general reasoning)
- 73% on Terminal-Bench (operational tasks)
This demonstrates that software-engineering proficiency ≠ general-assistant capability ≠ operational reliability. Each benchmark captures distinct aspects of agent capability.
The Benchmark Gaming Problem
Important Note: Research from Berkeley RDI has shown that Terminal-Bench, SWE-bench, and GAIA (and other prominent agent benchmarks) can all be exploited to achieve near-perfect scores without solving tasks:
- OSWorld: VLM scoring can be manipulated by screenshot interpretation
- Terminal-Bench: Protected files can sometimes be accessed before sandboxing fully activates
- WebArena: Reference answers in local JSON files accessible to agents
The Terminal-Bench team is aware of these vulnerabilities and continues to improve sandboxing and verification mechanisms, but this highlights an ongoing challenge in creating truly robust evaluation benchmarks.
Why Terminal-Bench 2.0 Became the Industry Standard
Addressing a Critical Gap
As the announcement post states: "Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models."
Terminal-Bench 2.0 fills this gap by:
- Real-world workflow inspiration: Tasks come from actual developer and sysadmin work, not contrived academic problems
- Meaningful difficulty: Frontier models still score below 65%, leaving room to measure progress
- Deterministic evaluation: No LLM judges, no subjective scoring—just tests that pass or fail
- Reproducible infrastructure: Containerization ensures consistent execution across environments
Rapid Industry Adoption
Terminal-Bench 1.0 was a "runaway success"—since its May 2025 launch:
- Used by virtually every frontier AI lab
- Became the de facto standard for agent evaluation
- Cited in model release announcements and research papers
- Integrated into agent development workflows
Version 2.0 continues this trajectory while addressing quality concerns and raising difficulty before saturation set in.
Quality Commitment
Terminal-Bench 2.0 represents a commitment to maintaining the highest quality evaluation infrastructure as AI agent capabilities increase:
- ~3 reviewer-hours per task ensures tasks are solvable, realistic, well-specified
- Continuous improvement based on community feedback
- Transparent methodology with open-source Harbor framework
- Proactive difficulty scaling to prevent saturation
Differentiation Through Scope
While SWE-bench measures software patch correctness and GAIA measures multi-step reasoning with ground-truth answers, Terminal-Bench uniquely evaluates:
- Operational reliability across diverse terminal tasks
- Multi-step workflow planning and recovery
- Tool use in live environments
- Cross-domain capabilities (not just software engineering)
This broader scope makes Terminal-Bench particularly valuable for teams building general-purpose agents rather than domain-specific tools.
What Terminal-Bench 2.0 Tells Us About AI Agents in 2026
The Capability Ceiling
Frontier models still fail on 18-35% of tasks that humans complete routinely. This is not a transient gap—it reflects fundamental limitations in current agent architectures:
- Planning under uncertainty: Agents struggle when the next step depends on unknown output
- Error recovery: When a command fails, agents often retry the same approach or give up
- Domain knowledge transfer: Expertise in one domain (e.g., Python) doesn't automatically transfer to another (e.g., systems administration)
- Multi-step reasoning: Chains of 5+ dependent steps show exponential failure rates
The Agent Design Multiplier
Resolution rate correlates strongly with model capability AND agent orchestration. The 17% improvement Gemini 2.5 Pro saw with better scaffolding demonstrates that:
- Agent frameworks matter as much as underlying models
- Prompt engineering, tool design, and error handling are not solved problems
- There's no single "best" agent architecture—different designs excel at different task types
The Real-World Relevance Question
Terminal-Bench 2.0's tasks are inspired by real workflows, but are they representative of what actually matters in production?
Arguments for relevance:
- Tasks come from actual developer and sysadmin work
- Cover diverse domains encountered in practice
- Test end-to-end completion, not isolated skills
Arguments for caution:
- 89 tasks cannot capture all operational scenarios
- Containerized environments differ from production messiness
- Pass/fail scoring misses "good enough" trade-offs
The 37% gap between lab benchmark scores and real-world deployment performance observed across enterprise agentic AI systems suggests that even Terminal-Bench—among the best available benchmarks—cannot fully predict production readiness.
The Saturation Timeline
Version 1.0 saw frontier models climb from 20% (2025) to >50% in six months. Version 2.0 reset the bar, but models are already approaching 73%. If progress continues at this pace:
- 65% threshold crossed: Already achieved by top models
- 75% threshold likely by: Mid-2026
- 85% threshold likely by: Late 2026 to early 2027
- Saturation (>90%): Potentially mid-2027
The team will likely need Terminal-Bench 3.0 within 12-18 months to maintain differentiation at the frontier.
How to Use Terminal-Bench 2.0 in Your Workflow
Running the Benchmark
Official Distribution via Harbor:
harbor run -d terminal-bench@2.0
This command:
- Pulls the Terminal-Bench 2.0 task registry
- Provisions Daytona containers for each task
- Runs your agent against all 89 tasks
- Generates detailed CTRF reports and binary pass/fail results
Building Custom Agents
Minimum Implementation:
from harbor import BaseInstalledAgent
class MyAgent(BaseInstalledAgent):
def solve_task(self, instruction: str, container):
# Your agent logic here
# Use container.run_bash(command) to execute commands
# Parse outputs and decide next steps
# Return when task complete
pass
Provided Examples:
- Claude Code implementation as reference
- Terminus 2 for simple scaffold
- Documentation at github.com/laude-institute/harbor
Interpreting Results
What a 65% score means:
- Agent successfully completes 58 of 89 tasks
- 31 tasks either fail tests or cannot be completed
- Likely struggles with: hard tasks (16% avg), multi-step error recovery, domain-specific knowledge
What to optimize:
- Tool-use reliability: Does your agent correctly parse command outputs?
- Error recovery: What happens when a command fails?
- Planning depth: Can your agent handle 5+ dependent steps?
- Domain coverage: Does performance cluster in certain task categories?
The Benchmark Ecosystem: Where Terminal-Bench Fits
Complementary Benchmarks
For a complete picture of agent capability, use Terminal-Bench alongside:
- SWE-bench Pro: Software engineering proficiency
- GAIA: General-assistant reasoning
- OSWorld: GUI-based computer use
- LiveCodeBench: Contamination-resistant coding
Each benchmark tests distinct capabilities. High performance on one does not guarantee high performance on others.
When to Use Terminal-Bench
Use Terminal-Bench when:
- Evaluating command-line proficiency and operational tasks
- Testing multi-step workflow execution
- Measuring real-world task completion beyond isolated functions
- Comparing agent scaffolding designs with controlled model variables
Don't rely solely on Terminal-Bench when:
- Your use case is GUI-heavy (use OSWorld)
- You need domain-specific evaluation (use specialized benchmarks like Vals Finance Agent)
- Production involves long-term collaboration with humans (no benchmark captures this well)
The 37% Lab-to-Production Gap
Research shows that enterprise agentic AI systems exhibit a 37% gap between lab benchmark scores and real-world deployment performance. Terminal-Bench is among the best proxies for production readiness, but:
- Benchmarks test isolated task completion
- Production involves messy context, changing requirements, human collaboration
- Error detectability and correction ease matter as much as success rate
Recommendation: Use Terminal-Bench for relative comparisons between models and agents, but always validate in your specific production context before deployment.
The Future of Terminal-Bench
Likely Evolution
Based on the 1.0 → 2.0 transition, expect:
- Terminal-Bench 3.0 within 12-18 months as models approach 85-90%
- Harder tasks requiring deeper domain expertise
- Interactive tasks where environment state changes dynamically
- Adversarial tasks designed to expose specific failure modes
The Benchmark Treadmill
Terminal-Bench faces the same challenge as all benchmarks: saturation is inevitable. The question is not "if" but "when" and "how to respond."
Two strategies:
- Proactive difficulty scaling (the 1.0 → 2.0 approach): Raise the bar before saturation
- Dynamic task generation: Continuously generate new tasks to create a moving target (like LiveCodeBench for coding)
Terminal-Bench currently uses strategy #1. A shift to strategy #2 might be necessary to maintain relevance beyond 2027.
Community Contribution
Terminal-Bench 2.0 selected 89 tasks from 229 submissions by 93 contributors through crowdsourcing. This community-driven approach:
- Brings diverse domain expertise
- Scales task creation beyond core team bandwidth
- Reflects real-world task diversity
Future versions will likely lean more heavily on community contributions, potentially with:
- Submission portal for new tasks
- Peer review process for quality control
- Bounties for particularly challenging tasks
Bottom Line
Terminal-Bench 2.0 has earned its place as the de facto standard for AI agent evaluation because it tests what actually matters: Can agents complete real-world operational tasks reliably, recover from errors, and execute multi-step workflows across diverse domains?
The answer in 2026 is: mostly, but not entirely. Frontier models score 65-73% direct, with agent scaffolding pushing top combinations to 81-82%. This leaves a meaningful capability gap that differentiates systems and reveals failure modes invisible to saturated academic benchmarks.
For teams building agents, Terminal-Bench 2.0 provides:
- Relative comparison data to choose models and architectures
- Failure mode visibility to guide optimization
- Reproducible evaluation infrastructure via Harbor and Daytona
- Industry-standard reporting for stakeholder communication
But remember: Terminal-Bench scores are proxies, not guarantees. The 37% lab-to-production gap means you still need to validate in your specific context before trusting an agent with production work.
Read the official resources:
- Terminal-Bench Official Site
- Terminal-Bench 2.0 Announcement
- ArXiv Paper: Terminal-Bench (2601.11868)
- Harbor Framework GitHub
- Leaderboard
For complementary perspectives on agent evaluation, benchmarking, and production deployment, see our other guides:
- What Are Agent Skills: Complete Guide
- Stanford's AI Index 2026: Takeaways
- What is MCP (Model Context Protocol)
Disclosure: This post is editorial commentary on public materials from the Laude Institute, Stanford University, Snorkel AI, and the Terminal-Bench community. For academic or production citations, use the primary paper and official leaderboard data.