← Blog
explainx / blog

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

Terminal-Bench 2.0 is the industry-standard benchmark for evaluating AI agents on real-world terminal tasks. 89 carefully curated tasks, Harbor framework, and results from GPT-5.5, Claude Opus 4.7, and more.

18 min readExplainX Team
AI BenchmarksAI AgentsTerminal-BenchHarbor FrameworkAgent EvaluationAI TestingSoftware EngineeringResearch

Includes frontmatter plus an attribution block so copies credit explainx.ai and the canonical URL.

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

Every spring, the AI benchmarking landscape shifts. What was once challenging becomes saturated. What differentiated frontier models becomes statistical noise. But in May 2025, the Laude Institute, Stanford University, and Snorkel AI released something different: Terminal-Bench 1.0—a benchmark that became, in the words of the creators, a "runaway success," adopted by virtually every frontier lab within months.

Six months later, in November 2025, the team released Terminal-Bench 2.0. This wasn't just an incremental update. It was a proactive response to saturation—raising the bar before models conquered version 1.0, while fixing quality issues the community discovered through intensive usage. The result: 89 carefully curated tasks where frontier models still score below 65%, each task receiving approximately 3 reviewer-hours of human auditing to ensure it's solvable, realistic, and well-specified.

This post does four things: it explains what Terminal-Bench 2.0 actually measures, how it differs from other agent benchmarks like SWE-bench and GAIA, the Harbor framework that powers it, and what the current leaderboard tells us about the state of AI agents in 2026.


What Terminal-Bench 2.0 Actually Measures

The Core Thesis

Terminal-Bench 2.0 is a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Unlike academic benchmarks that test knowledge recall or isolated code generation, Terminal-Bench measures:

  • Operational reliability in live tool-driven environments
  • Multi-step planning and execution across complex workflows
  • Recovery capabilities when errors occur
  • Real-world task completion ranging from compiling code to training models and setting up servers
  • Tool use ability to operate a computer via terminal autonomously

Each task features a unique environment, a human-written solution, and comprehensive tests for verification. Tasks must be completed using only Bash commands through a headless terminal—no GUI, no shortcuts, no structured output templates to lean on.

Task Structure and Evaluation

Every Terminal-Bench 2.0 task consists of:

  1. Containerized environment initialized with relevant packages and files (all dependencies pinned for reproducibility)
  2. Natural language instruction describing the task to complete
  3. Comprehensive pytest test suite to verify completion
  4. Human-written reference solution manually created to solve the task

Scoring is binary and strict: Pass@1 only. Models must pass ALL pytest validation tests to receive credit for a task. A task with 10 tests where 9 pass still scores 0. There are no multiple attempts, no partial credit, no second chances.

Verification Process

The evaluation pipeline is deterministic and transparent:

  1. Install uv package manager
  2. Use uvx to install pytest and task-specific dependencies with pinned versions
  3. Run pytest with specific formatting flags
  4. Generate two outputs:
    • Detailed test report (Common Test Reporting Format via pytest-json-ctrf plugin)
    • Single binary success file (1 or 0)

This approach eliminates the ambiguity and gaming vulnerabilities that plague LLM-judged benchmarks.


The Task Taxonomy: What Agents Actually Face

Terminal-Bench 2.0 covers diverse domains that reflect real developer and system administrator workflows:

Software Engineering

  • Build systems and compilation: Navigate makefiles, dependency chains, compiler flags
  • Dependency resolution: Install correct package versions, resolve conflicts
  • Git operations with merge conflicts: Real repository manipulation beyond simple commits
  • COBOL-to-Python rewrites: Legacy code migration requiring language understanding
  • Code coverage analysis with gcov: Development tooling proficiency

Security & Cryptography

  • Differential cryptanalysis on cipher systems: Advanced security knowledge
  • Password recovery: Security testing and cracking techniques
  • Vulnerability identification: Code auditing for security flaws
  • API key removal from codebases: Security hygiene and scanning

Machine Learning & Data Science

  • Training fastText models on Yelp data with accuracy/size constraints: Real ML pipelines with competing objectives
  • Neural network framework integration: Model deployment and optimization
  • Model optimization: Balancing accuracy, size, and performance

System Administration

  • Server setup and configuration: Production-like infrastructure tasks
  • Building and running Linux from source code: Deep systems knowledge

Domain-Specific Tasks

  • Biology/computational tasks requiring specialized domain knowledge
  • Chess engine move optimization: Algorithm implementation and testing
  • Physics-based rendering: Scientific computing
  • Video processing: Multimedia manipulation
  • Personal assistant tasks: Real-world automation workflows

Difficulty Scaling

Tasks range from easy to hard:

  • Easy tasks: ~65% average accuracy across models
  • Hard tasks: ~16% average accuracy across models

These difficulty labels are author-estimated for humans and may not reflect agent difficulty—some "easy" tasks trip up frontier models while some "hard" tasks fall to clever tool use.


Version 2.0: What Changed and Why

The Quality Problem in Version 1.0

Terminal-Bench 1.0 launched in May 2025 with 80 tasks and became an instant success. But success brought scrutiny. The community—including the researchers themselves—discovered problems:

  • Several tasks were unsolvable for artificial reasons (configuration issues, environment problems)
  • Some set arbitrary thresholds that didn't reflect task completion
  • Others lacked robustness—solutions that worked one day failed the next (the "download-youtube" task became notorious for breaking with YouTube's constantly changing anti-bot protections)

As frontier models climbed over 50% success rate on version 1.0, the benchmark risked becoming another saturated metric.

The 2.0 Response: Proactive Difficulty Scaling

Instead of waiting for saturation, the team raised the bar in November 2025:

Task Quality and Verification

  • 89 tasks (up from 80) with substantial manual and LM-assisted verification
  • Each task received approximately 3 reviewer-hours of human auditing
  • Tasks verified to be: (1) solvable, (2) realistic, and (3) well-specified
  • Problematic tasks eliminated entirely

Increased Difficulty

  • Version 1.0: Frontier models climbed over 50%
  • Version 2.0: Frontier models score less than 65%
  • The benchmark now better represents frontier challenges that distinguish truly capable agents

Technical Infrastructure Upgrade

  • Version 1.0:

    • Used structured outputs to enforce response schema
    • Run on EC2 instances using Docker containers locally
  • Version 2.0:

    • Does NOT use structured outputs (invalid/missing JSON retried with warning)
    • Run remotely using Daytona managed containers for better scalability
    • Uses new Harbor framework (complete rewrite for improved reliability, observability, scalability, and performance)

Specific Improvements

  • Eliminated environment-dependent failures (like the YouTube anti-bot problem)
  • Better separation of task specification from implementation details
  • Improved test coverage and edge case handling
  • Clearer documentation for each task

The Harbor Framework: How Terminal-Bench Actually Runs

Why Harbor Exists

The original Terminal-Bench 1.0 harness worked, but scaling to thousands of evaluations across 16 frontier models and 6 state-of-the-art agents revealed bottlenecks:

  • Local execution on EC2 instances maxed out at 4-6 containers before hitting CPU, memory, and I/O constraints
  • Manual orchestration slowed iteration cycles
  • Limited observability made debugging agent failures difficult
  • No standardized interface for testing custom agents

Harbor is the answer: an open-source framework for evaluating and optimizing agents in container environments.

Harbor Architecture

Core Components:

  • Harbor task format: Standardized specification for agent tasks
  • Harbor harness: Execution engine for running agents against tasks
  • Harbor registry: Centralized task distribution (harbor run -d terminal-bench@2.0)
  • Daytona integration: Managed runtime for scalable, reproducible sandboxes

Supported Agents:

  • Claude Code
  • Codex CLI
  • OpenHands
  • Mini-SWE-Agent
  • Terminus 2 (simple neutral scaffold for comparing raw model performance)

Custom Agent Support: Developers can create custom agents by subclassing BaseInstalledAgent or BaseAgent, receiving the instruction and Docker container, then exploring/manipulating the environment through tool calls (editing files, running Bash commands).

Daytona: From Local to Cloud

The Evolution:

  1. Initial approach: Docker sandboxes on local machines (4-6 containers max)
  2. Problem: CPU, memory, and I/O constraints made it impractical for thousands of tests
  3. Solution: Daytona managed runtime
    • Provisions long-running, reproducible sandboxes at scale
    • Supports thousands of parallel experiments simultaneously
    • For tasks with unique specs, teams submit Dockerfiles to Daytona's Declarative Image Builder
    • Containers built and executed on demand

This infrastructure enabled the 32,155 total trials across models and agents that powered the 2.0 leaderboard.

Security and Contamination Prevention

Sandboxing: Each task runs in isolated Docker containers to prevent cross-contamination and ensure reproducibility.

Contamination Detection: Includes Big-Bench canary string in each repository file to aid in training corpus decontamination. The team acknowledges that private test sets are considered out of scope due to the community investment required, but the canary string provides transparency for model developers.


The 2026 Leaderboard: What Models Can (and Can't) Do

Direct Model Performance

As of 2026, the Terminal-Bench 2.0 leaderboard shows:

Top Direct Model Scores:

  • GPT-5.5: 73.20% (leading)
  • Claude Opus 4.7: 68.54%
  • Gemini 3.1 Pro Preview: 67.42%
  • GPT-5.3 Codex: 64.05%
  • Claude Sonnet 4.6 / Muse Spark: 59.55% (tied)

The 65% Ceiling: Despite frontier models approaching human-level performance on many academic benchmarks (MMMU-Pro models within 0.3 points of human experts), they still fail on 18-35% of Terminal-Bench 2.0 tasks—tasks that experienced developers complete routinely.

Agent + Model Combinations

Top Agent Scores (combining agent scaffolding with frontier models):

  • ForgeCode + Claude Opus 4.6: 81.8% (top score)
  • ForgeCode + GPT-5.4: 81.8% (tied for top)
  • TongAgents + Gemini 3.1 Pro: 80.2%
  • SageAgent + GPT-5.3-Codex: 78.4%
  • ForgeCode + Gemini 3.1 Pro: 78.4%

The Agent Scaffolding Effect: The same model can perform very differently with different agent implementations. For example, Gemini 2.5 Pro's pass rate improved 17% with Terminus 2 scaffolding over OpenHands—demonstrating that agent design matters significantly.

Evaluated Agents

The benchmark has tested 6 state-of-the-art agents across 16 frontier models with 32,155 total trials:

  • Terminus 2: Simple neutral scaffold for comparing model performance
  • Claude Code: Anthropic's agent implementation
  • Codex CLI: OpenAI's command-line agent
  • OpenHands: Open-source agentic framework
  • Mini-SWE-Agent: Lightweight software engineering agent
  • ForgeCode, TongAgents, SageAgent: Specialized agent frameworks

Performance by Difficulty

  • Easy tasks: ~65% average accuracy across frontier models
  • Hard tasks: ~16% average accuracy across frontier models

The 49-point gap between easy and hard tasks affects all models uniformly, suggesting that current agent architectures struggle with similar bottlenecks regardless of underlying model capability.


Terminal-Bench vs. Other Agent Benchmarks

Terminal-Bench vs. SWE-bench

SWE-bench (Verified: 500 tasks, Pro: 731 tasks):

  • Focus: GitHub issue resolution and software patch correctness
  • Task: Receive issue description + repo snapshot → produce patch that passes test suite
  • Scoring: Patch must pass issue's associated test suite
  • Domain: Software engineering only
  • 2026 Leaders: Claude (77.2%), GPT-5 (74.9%)

Terminal-Bench (89 tasks):

  • Focus: Operational reliability across diverse terminal tasks
  • Task: Natural language instruction → complete task using Bash commands
  • Scoring: Must pass comprehensive pytest validation
  • Domain: Cross-domain (ML, security, system admin, data science, etc.)
  • 2026 Leaders: GPT-5.5 (73.20% direct), ForgeCode combos (81.8%)

Key Difference: SWE-bench measures software-engineering proficiency; Terminal-Bench measures broader operational capabilities and tool-use accuracy. A model can excel at one and struggle at the other—they test distinct skill sets.

Terminal-Bench vs. GAIA

GAIA (466 tasks):

  • Focus: Multi-step reasoning with diverse tasks
  • Task: Questions requiring multi-step reasoning (e.g., "Find country's GDP and convert currency")
  • Scoring: Answers scored against human-annotated ground truth + LLM judge for paraphrasing
  • Nature: Question-answering
  • 2026 Leaders: Claude Mythos Preview (52.3%), GPT-5.4 Pro (50.5%)

Terminal-Bench:

  • Focus: Action-taking in terminal environments
  • Task: End-to-end task completion via command execution
  • Scoring: Deterministic based on exit codes, file diffs, output strings
  • Nature: Task execution

Key Difference: GAIA tests general-assistant reasoning capability; Terminal-Bench tests action execution in live environments. GAIA asks "Can you figure out the answer?"; Terminal-Bench asks "Can you actually make it happen?"

Performance Variability Across Benchmarks

Models show different strengths across benchmarks. In the same week, a model might achieve:

  • 87% on SWE-bench Verified (software engineering)
  • 44% on GAIA (general reasoning)
  • 73% on Terminal-Bench (operational tasks)

This demonstrates that software-engineering proficiency ≠ general-assistant capability ≠ operational reliability. Each benchmark captures distinct aspects of agent capability.

The Benchmark Gaming Problem

Important Note: Research from Berkeley RDI has shown that Terminal-Bench, SWE-bench, and GAIA (and other prominent agent benchmarks) can all be exploited to achieve near-perfect scores without solving tasks:

  • OSWorld: VLM scoring can be manipulated by screenshot interpretation
  • Terminal-Bench: Protected files can sometimes be accessed before sandboxing fully activates
  • WebArena: Reference answers in local JSON files accessible to agents

The Terminal-Bench team is aware of these vulnerabilities and continues to improve sandboxing and verification mechanisms, but this highlights an ongoing challenge in creating truly robust evaluation benchmarks.


Why Terminal-Bench 2.0 Became the Industry Standard

Addressing a Critical Gap

As the announcement post states: "Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models."

Terminal-Bench 2.0 fills this gap by:

  1. Real-world workflow inspiration: Tasks come from actual developer and sysadmin work, not contrived academic problems
  2. Meaningful difficulty: Frontier models still score below 65%, leaving room to measure progress
  3. Deterministic evaluation: No LLM judges, no subjective scoring—just tests that pass or fail
  4. Reproducible infrastructure: Containerization ensures consistent execution across environments

Rapid Industry Adoption

Terminal-Bench 1.0 was a "runaway success"—since its May 2025 launch:

  • Used by virtually every frontier AI lab
  • Became the de facto standard for agent evaluation
  • Cited in model release announcements and research papers
  • Integrated into agent development workflows

Version 2.0 continues this trajectory while addressing quality concerns and raising difficulty before saturation set in.

Quality Commitment

Terminal-Bench 2.0 represents a commitment to maintaining the highest quality evaluation infrastructure as AI agent capabilities increase:

  • ~3 reviewer-hours per task ensures tasks are solvable, realistic, well-specified
  • Continuous improvement based on community feedback
  • Transparent methodology with open-source Harbor framework
  • Proactive difficulty scaling to prevent saturation

Differentiation Through Scope

While SWE-bench measures software patch correctness and GAIA measures multi-step reasoning with ground-truth answers, Terminal-Bench uniquely evaluates:

  • Operational reliability across diverse terminal tasks
  • Multi-step workflow planning and recovery
  • Tool use in live environments
  • Cross-domain capabilities (not just software engineering)

This broader scope makes Terminal-Bench particularly valuable for teams building general-purpose agents rather than domain-specific tools.


What Terminal-Bench 2.0 Tells Us About AI Agents in 2026

The Capability Ceiling

Frontier models still fail on 18-35% of tasks that humans complete routinely. This is not a transient gap—it reflects fundamental limitations in current agent architectures:

  • Planning under uncertainty: Agents struggle when the next step depends on unknown output
  • Error recovery: When a command fails, agents often retry the same approach or give up
  • Domain knowledge transfer: Expertise in one domain (e.g., Python) doesn't automatically transfer to another (e.g., systems administration)
  • Multi-step reasoning: Chains of 5+ dependent steps show exponential failure rates

The Agent Design Multiplier

Resolution rate correlates strongly with model capability AND agent orchestration. The 17% improvement Gemini 2.5 Pro saw with better scaffolding demonstrates that:

  1. Agent frameworks matter as much as underlying models
  2. Prompt engineering, tool design, and error handling are not solved problems
  3. There's no single "best" agent architecture—different designs excel at different task types

The Real-World Relevance Question

Terminal-Bench 2.0's tasks are inspired by real workflows, but are they representative of what actually matters in production?

Arguments for relevance:

  • Tasks come from actual developer and sysadmin work
  • Cover diverse domains encountered in practice
  • Test end-to-end completion, not isolated skills

Arguments for caution:

  • 89 tasks cannot capture all operational scenarios
  • Containerized environments differ from production messiness
  • Pass/fail scoring misses "good enough" trade-offs

The 37% gap between lab benchmark scores and real-world deployment performance observed across enterprise agentic AI systems suggests that even Terminal-Bench—among the best available benchmarks—cannot fully predict production readiness.

The Saturation Timeline

Version 1.0 saw frontier models climb from 20% (2025) to >50% in six months. Version 2.0 reset the bar, but models are already approaching 73%. If progress continues at this pace:

  • 65% threshold crossed: Already achieved by top models
  • 75% threshold likely by: Mid-2026
  • 85% threshold likely by: Late 2026 to early 2027
  • Saturation (>90%): Potentially mid-2027

The team will likely need Terminal-Bench 3.0 within 12-18 months to maintain differentiation at the frontier.


How to Use Terminal-Bench 2.0 in Your Workflow

Running the Benchmark

Official Distribution via Harbor:

harbor run -d terminal-bench@2.0

This command:

  1. Pulls the Terminal-Bench 2.0 task registry
  2. Provisions Daytona containers for each task
  3. Runs your agent against all 89 tasks
  4. Generates detailed CTRF reports and binary pass/fail results

Building Custom Agents

Minimum Implementation:

from harbor import BaseInstalledAgent

class MyAgent(BaseInstalledAgent):
    def solve_task(self, instruction: str, container):
        # Your agent logic here
        # Use container.run_bash(command) to execute commands
        # Parse outputs and decide next steps
        # Return when task complete
        pass

Provided Examples:

Interpreting Results

What a 65% score means:

  • Agent successfully completes 58 of 89 tasks
  • 31 tasks either fail tests or cannot be completed
  • Likely struggles with: hard tasks (16% avg), multi-step error recovery, domain-specific knowledge

What to optimize:

  1. Tool-use reliability: Does your agent correctly parse command outputs?
  2. Error recovery: What happens when a command fails?
  3. Planning depth: Can your agent handle 5+ dependent steps?
  4. Domain coverage: Does performance cluster in certain task categories?

The Benchmark Ecosystem: Where Terminal-Bench Fits

Complementary Benchmarks

For a complete picture of agent capability, use Terminal-Bench alongside:

Each benchmark tests distinct capabilities. High performance on one does not guarantee high performance on others.

When to Use Terminal-Bench

Use Terminal-Bench when:

  • Evaluating command-line proficiency and operational tasks
  • Testing multi-step workflow execution
  • Measuring real-world task completion beyond isolated functions
  • Comparing agent scaffolding designs with controlled model variables

Don't rely solely on Terminal-Bench when:

  • Your use case is GUI-heavy (use OSWorld)
  • You need domain-specific evaluation (use specialized benchmarks like Vals Finance Agent)
  • Production involves long-term collaboration with humans (no benchmark captures this well)

The 37% Lab-to-Production Gap

Research shows that enterprise agentic AI systems exhibit a 37% gap between lab benchmark scores and real-world deployment performance. Terminal-Bench is among the best proxies for production readiness, but:

  • Benchmarks test isolated task completion
  • Production involves messy context, changing requirements, human collaboration
  • Error detectability and correction ease matter as much as success rate

Recommendation: Use Terminal-Bench for relative comparisons between models and agents, but always validate in your specific production context before deployment.


The Future of Terminal-Bench

Likely Evolution

Based on the 1.0 → 2.0 transition, expect:

  1. Terminal-Bench 3.0 within 12-18 months as models approach 85-90%
  2. Harder tasks requiring deeper domain expertise
  3. Interactive tasks where environment state changes dynamically
  4. Adversarial tasks designed to expose specific failure modes

The Benchmark Treadmill

Terminal-Bench faces the same challenge as all benchmarks: saturation is inevitable. The question is not "if" but "when" and "how to respond."

Two strategies:

  1. Proactive difficulty scaling (the 1.0 → 2.0 approach): Raise the bar before saturation
  2. Dynamic task generation: Continuously generate new tasks to create a moving target (like LiveCodeBench for coding)

Terminal-Bench currently uses strategy #1. A shift to strategy #2 might be necessary to maintain relevance beyond 2027.

Community Contribution

Terminal-Bench 2.0 selected 89 tasks from 229 submissions by 93 contributors through crowdsourcing. This community-driven approach:

  • Brings diverse domain expertise
  • Scales task creation beyond core team bandwidth
  • Reflects real-world task diversity

Future versions will likely lean more heavily on community contributions, potentially with:

  • Submission portal for new tasks
  • Peer review process for quality control
  • Bounties for particularly challenging tasks

Bottom Line

Terminal-Bench 2.0 has earned its place as the de facto standard for AI agent evaluation because it tests what actually matters: Can agents complete real-world operational tasks reliably, recover from errors, and execute multi-step workflows across diverse domains?

The answer in 2026 is: mostly, but not entirely. Frontier models score 65-73% direct, with agent scaffolding pushing top combinations to 81-82%. This leaves a meaningful capability gap that differentiates systems and reveals failure modes invisible to saturated academic benchmarks.

For teams building agents, Terminal-Bench 2.0 provides:

  • Relative comparison data to choose models and architectures
  • Failure mode visibility to guide optimization
  • Reproducible evaluation infrastructure via Harbor and Daytona
  • Industry-standard reporting for stakeholder communication

But remember: Terminal-Bench scores are proxies, not guarantees. The 37% lab-to-production gap means you still need to validate in your specific context before trusting an agent with production work.

Read the official resources:

For complementary perspectives on agent evaluation, benchmarking, and production deployment, see our other guides:


Disclosure: This post is editorial commentary on public materials from the Laude Institute, Stanford University, Snorkel AI, and the Terminal-Bench community. For academic or production citations, use the primary paper and official leaderboard data.

Related posts