What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Why does Terminal-Bench 2.0 matter for the AI agent ecosystem?

Terminal-Bench 2.0 has become the de facto industry standard because it tests real-world operational capabilities rather than academic knowledge. It's used by virtually every frontier AI lab, provides reproducible containerized evaluation, and measures tasks inspired by actual developer and sysadmin workflows—making it more relevant than benchmarks that test isolated skills.

What is the Harbor framework?

Harbor is an open-source framework for evaluating and optimizing AI agents in container environments. It's a complete rewrite of the original Terminal-Bench harness, offering improved reliability, observability, scalability, and performance. Harbor supports popular agents like Claude Code, Codex CLI, OpenHands, and Mini-SWE-Agent, and uses Daytona managed containers for scalable remote execution.

How is Terminal-Bench 2.0 different from SWE-bench?

SWE-bench focuses specifically on software engineering—resolving GitHub issues and producing patches that pass test suites. Terminal-Bench covers broader operational tasks across machine learning, security, system administration, biology, and more. While SWE-bench tests code patch correctness, Terminal-Bench measures end-to-end task completion and operational reliability in diverse domains.

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a carefully curated benchmark of 89 tasks designed to test AI agents' ability to complete complex, multi-step workflows in sandboxed terminal environments. Created by the Laude Institute and Stanford University, it measures operational reliability, planning, error recovery, and real-world task completion using only command-line interfaces.

How does Terminal-Bench 2.0 differ from version 1.0?

Version 2.0 represents a substantial upgrade: each of the 89 tasks received approximately 3 reviewer-hours of manual verification (vs 80 tasks in v1.0 with quality issues); it's significantly harder (frontier models score 50% on v1.0); uses the new Harbor framework instead of local Docker; and eliminates problematic tasks that were unsolvable for artificial reasons.

What models have been evaluated on Terminal-Bench 2.0?

As of 2026, GPT-5.5 leads direct model evaluations at 73.20%, followed by Claude Opus 4.7 (68.54%), Gemini 3.1 Pro Preview (67.42%), and GPT-5.3 Codex (64.05%). Agent+model combinations achieve higher scores: ForgeCode with Claude Opus 4.6 or GPT-5.4 reaches 81.8%, while TongAgents with Gemini 3.1 Pro achieves 80.2%.

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters | explainx.ai Blog

Every spring, the AI benchmarking landscape shifts. What was once challenging becomes saturated. What differentiated frontier models becomes statistical noise. But in May 2025, the Laude Institute, Stanford University, and Snorkel AI released something different: Terminal-Bench 1.0—a benchmark that became, in the words of the creators, a "runaway success," adopted by virtually every frontier lab within months.

Six months later, in November 2025, the team released Terminal-Bench 2.0. This wasn't just an incremental update. It was a proactive response to saturation—raising the bar before models conquered version 1.0, while fixing quality issues the community discovered through intensive usage. The result: 89 carefully curated tasks where frontier models still score below 65%, each task receiving approximately 3 reviewer-hours of human auditing to ensure it's solvable, realistic, and well-specified.

This post does four things: it explains what Terminal-Bench 2.0 actually measures, how it differs from other agent benchmarks like SWE-bench and GAIA, the Harbor framework that powers it, and what the current leaderboard tells us about the state of AI agents in 2026.

What Terminal-Bench 2.0 Actually Measures

The Core Thesis

Terminal-Bench 2.0 is a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Unlike academic benchmarks that test knowledge recall or isolated code generation, Terminal-Bench measures:

Operational reliability in live tool-driven environments
Multi-step planning and execution across complex workflows
Recovery capabilities when errors occur
Real-world task completion ranging from compiling code to training models and setting up servers
Tool use ability to operate a computer via terminal autonomously

Each task features a unique environment, a human-written solution, and comprehensive tests for verification. Tasks must be completed using only Bash commands through a headless terminal—no GUI, no shortcuts, no structured output templates to lean on.

Task Structure and Evaluation

Every Terminal-Bench 2.0 task consists of:

Containerized environment initialized with relevant packages and files (all dependencies pinned for reproducibility)
Natural language instruction describing the task to complete
Comprehensive pytest test suite to verify completion
Human-written reference solution manually created to solve the task

Scoring is binary and strict: Pass@1 only. Models must pass ALL pytest validation tests to receive credit for a task. A task with 10 tests where 9 pass still scores 0. There are no multiple attempts, no partial credit, no second chances.

Verification Process

The evaluation pipeline is deterministic and transparent:

Install uv package manager
Use uvx to install pytest and task-specific dependencies with pinned versions
Run pytest with specific formatting flags
Generate two outputs:
- Detailed test report (Common Test Reporting Format via pytest-json-ctrf plugin)
- Single binary success file (1 or 0)

This approach eliminates the ambiguity and gaming vulnerabilities that plague LLM-judged benchmarks.

The Task Taxonomy: What Agents Actually Face

Terminal-Bench 2.0 covers diverse domains that reflect real developer and system administrator workflows:

Software Engineering

Build systems and compilation: Navigate makefiles, dependency chains, compiler flags
Dependency resolution: Install correct package versions, resolve conflicts
Git operations with merge conflicts: Real repository manipulation beyond simple commits
COBOL-to-Python rewrites: Legacy code migration requiring language understanding
Code coverage analysis with gcov: Development tooling proficiency

Security & Cryptography

Differential cryptanalysis on cipher systems: Advanced security knowledge
Password recovery: Security testing and cracking techniques
Vulnerability identification: Code auditing for security flaws
API key removal from codebases: Security hygiene and scanning

Machine Learning & Data Science

Training fastText models on Yelp data with accuracy/size constraints: Real ML pipelines with competing objectives
Neural network framework integration: Model deployment and optimization
Model optimization: Balancing accuracy, size, and performance

System Administration

Server setup and configuration: Production-like infrastructure tasks
Building and running Linux from source code: Deep systems knowledge

Domain-Specific Tasks

Biology/computational tasks requiring specialized domain knowledge
Chess engine move optimization: Algorithm implementation and testing
Physics-based rendering: Scientific computing
Video processing: Multimedia manipulation
Personal assistant tasks: Real-world automation workflows

Difficulty Scaling

Tasks range from easy to hard:

Easy tasks: ~65% average accuracy across models
Hard tasks: ~16% average accuracy across models

These difficulty labels are author-estimated for humans and may not reflect agent difficulty—some "easy" tasks trip up frontier models while some "hard" tasks fall to clever tool use.

Version 2.0: What Changed and Why

The Quality Problem in Version 1.0

Terminal-Bench 1.0 launched in May 2025 with 80 tasks and became an instant success. But success brought scrutiny. The community—including the researchers themselves—discovered problems:

Several tasks were unsolvable for artificial reasons (configuration issues, environment problems)
Some set arbitrary thresholds that didn't reflect task completion
Others lacked robustness—solutions that worked one day failed the next (the "download-youtube" task became notorious for breaking with YouTube's constantly changing anti-bot protections)

As frontier models climbed over 50% success rate on version 1.0, the benchmark risked becoming another saturated metric.

The 2.0 Response: Proactive Difficulty Scaling

Instead of waiting for saturation, the team raised the bar in November 2025:

Task Quality and Verification

89 tasks (up from 80) with substantial manual and LM-assisted verification
Each task received approximately 3 reviewer-hours of human auditing
Tasks verified to be: (1) solvable, (2) realistic, and (3) well-specified
Problematic tasks eliminated entirely

Increased Difficulty

Version 1.0: Frontier models climbed over 50%
Version 2.0: Frontier models score less than 65%
The benchmark now better represents frontier challenges that distinguish truly capable agents

Technical Infrastructure Upgrade

Version 1.0:
- Used structured outputs to enforce response schema
- Run on EC2 instances using Docker containers locally
Version 2.0:
- Does NOT use structured outputs (invalid/missing JSON retried with warning)
- Run remotely using Daytona managed containers for better scalability
- Uses new Harbor framework (complete rewrite for improved reliability, observability, scalability, and performance)

Specific Improvements

Eliminated environment-dependent failures (like the YouTube anti-bot problem)
Better separation of task specification from implementation details
Improved test coverage and edge case handling
Clearer documentation for each task

The Harbor Framework: How Terminal-Bench Actually Runs

Why Harbor Exists

The original Terminal-Bench 1.0 harness worked, but scaling to thousands of evaluations across 16 frontier models and 6 state-of-the-art agents revealed bottlenecks:

Local execution on EC2 instances maxed out at 4-6 containers before hitting CPU, memory, and I/O constraints
Manual orchestration slowed iteration cycles
Limited observability made debugging agent failures difficult
No standardized interface for testing custom agents

Harbor is the answer: an open-source framework for evaluating and optimizing agents in container environments.

Harbor Architecture

Core Components:

Harbor task format: Standardized specification for agent tasks
Harbor harness: Execution engine for running agents against tasks
Harbor registry: Centralized task distribution (harbor run -d terminal-bench@2.0)
Daytona integration: Managed runtime for scalable, reproducible sandboxes

Supported Agents:

Claude Code
Codex CLI
OpenHands
Mini-SWE-Agent
Terminus 2 (simple neutral scaffold for comparing raw model performance)

Custom Agent Support: Developers can create custom agents by subclassing BaseInstalledAgent or BaseAgent, receiving the instruction and Docker container, then exploring/manipulating the environment through tool calls (editing files, running Bash commands).

Daytona: From Local to Cloud

The Evolution:

Initial approach: Docker sandboxes on local machines (4-6 containers max)
Problem: CPU, memory, and I/O constraints made it impractical for thousands of tests
Solution: Daytona managed runtime
- Provisions long-running, reproducible sandboxes at scale
- Supports thousands of parallel experiments simultaneously
- For tasks with unique specs, teams submit Dockerfiles to Daytona's Declarative Image Builder
- Containers built and executed on demand

This infrastructure enabled the 32,155 total trials across models and agents that powered the 2.0 leaderboard.

Security and Contamination Prevention

Sandboxing: Each task runs in isolated Docker containers to prevent cross-contamination and ensure reproducibility.

Contamination Detection: Includes Big-Bench canary string in each repository file to aid in training corpus decontamination. The team acknowledges that private test sets are considered out of scope due to the community investment required, but the canary string provides transparency for model developers.

The 2026 Leaderboard: What Models Can (and Can't) Do

Direct Model Performance

As of 2026, the Terminal-Bench 2.0 leaderboard shows:

Top Direct Model Scores:

GPT-5.5: 73.20% (leading)
Claude Opus 4.7: 68.54%
Gemini 3.1 Pro Preview: 67.42%
GPT-5.3 Codex: 64.05%
Claude Sonnet 4.6 / Muse Spark: 59.55% (tied)

The 65% Ceiling: Despite frontier models approaching human-level performance on many academic benchmarks (MMMU-Pro models within 0.3 points of human experts), they still fail on 18-35% of Terminal-Bench 2.0 tasks—tasks that experienced developers complete routinely.

Agent + Model Combinations

Top Agent Scores (combining agent scaffolding with frontier models):

ForgeCode + Claude Opus 4.6: 81.8% (top score)
ForgeCode + GPT-5.4: 81.8% (tied for top)
TongAgents + Gemini 3.1 Pro: 80.2%
SageAgent + GPT-5.3-Codex: 78.4%
ForgeCode + Gemini 3.1 Pro: 78.4%

The Agent Scaffolding Effect: The same model can perform very differently with different agent implementations. For example, Gemini 2.5 Pro's pass rate improved 17% with Terminus 2 scaffolding over OpenHands—demonstrating that agent design matters significantly.

Evaluated Agents

The benchmark has tested 6 state-of-the-art agents across 16 frontier models with 32,155 total trials:

Terminus 2: Simple neutral scaffold for comparing model performance
Claude Code: Anthropic's agent implementation
Codex CLI: OpenAI's command-line agent
OpenHands: Open-source agentic framework
Mini-SWE-Agent: Lightweight software engineering agent
ForgeCode, TongAgents, SageAgent: Specialized agent frameworks

Performance by Difficulty

Easy tasks: ~65% average accuracy across frontier models
Hard tasks: ~16% average accuracy across frontier models

The 49-point gap between easy and hard tasks affects all models uniformly, suggesting that current agent architectures struggle with similar bottlenecks regardless of underlying model capability.

Terminal-Bench vs. Other Agent Benchmarks

Terminal-Bench vs. SWE-bench

SWE-bench (Verified: 500 tasks, Pro: 731 tasks):

Focus: GitHub issue resolution and software patch correctness
Task: Receive issue description + repo snapshot → produce patch that passes test suite
Scoring: Patch must pass issue's associated test suite
Domain: Software engineering only
2026 Leaders: Claude (77.2%), GPT-5 (74.9%)

Terminal-Bench (89 tasks):

Focus: Operational reliability across diverse terminal tasks
Task: Natural language instruction → complete task using Bash commands
Scoring: Must pass comprehensive pytest validation
Domain: Cross-domain (ML, security, system admin, data science, etc.)
2026 Leaders: GPT-5.5 (73.20% direct), ForgeCode combos (81.8%)

Key Difference: SWE-bench measures software-engineering proficiency; Terminal-Bench measures broader operational capabilities and tool-use accuracy. A model can excel at one and struggle at the other—they test distinct skill sets.

Terminal-Bench vs. GAIA

GAIA (466 tasks):

Focus: Multi-step reasoning with diverse tasks
Task: Questions requiring multi-step reasoning (e.g., "Find country's GDP and convert currency")
Scoring: Answers scored against human-annotated ground truth + LLM judge for paraphrasing
Nature: Question-answering
2026 Leaders: Claude Mythos Preview (52.3%), GPT-5.4 Pro (50.5%)

Terminal-Bench:

Focus: Action-taking in terminal environments
Task: End-to-end task completion via command execution
Scoring: Deterministic based on exit codes, file diffs, output strings
Nature: Task execution

Key Difference: GAIA tests general-assistant reasoning capability; Terminal-Bench tests action execution in live environments. GAIA asks "Can you figure out the answer?"; Terminal-Bench asks "Can you actually make it happen?"

Performance Variability Across Benchmarks

Models show different strengths across benchmarks. In the same week, a model might achieve:

87% on SWE-bench Verified (software engineering)
44% on GAIA (general reasoning)
73% on Terminal-Bench (operational tasks)

This demonstrates that software-engineering proficiency ≠ general-assistant capability ≠ operational reliability. Each benchmark captures distinct aspects of agent capability.

The Benchmark Gaming Problem

Important Note: Research from Berkeley RDI has shown that Terminal-Bench, SWE-bench, and GAIA (and other prominent agent benchmarks) can all be exploited to achieve near-perfect scores without solving tasks:

OSWorld: VLM scoring can be manipulated by screenshot interpretation
Terminal-Bench: Protected files can sometimes be accessed before sandboxing fully activates
WebArena: Reference answers in local JSON files accessible to agents

The Terminal-Bench team is aware of these vulnerabilities and continues to improve sandboxing and verification mechanisms, but this highlights an ongoing challenge in creating truly robust evaluation benchmarks.

Why Terminal-Bench 2.0 Became the Industry Standard

Addressing a Critical Gap

As the announcement post states: "Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models."

Terminal-Bench 2.0 fills this gap by:

Real-world workflow inspiration: Tasks come from actual developer and sysadmin work, not contrived academic problems
Meaningful difficulty: Frontier models still score below 65%, leaving room to measure progress
Deterministic evaluation: No LLM judges, no subjective scoring—just tests that pass or fail
Reproducible infrastructure: Containerization ensures consistent execution across environments

Rapid Industry Adoption

Terminal-Bench 1.0 was a "runaway success"—since its May 2025 launch:

Used by virtually every frontier AI lab
Became the de facto standard for agent evaluation
Cited in model release announcements and research papers
Integrated into agent development workflows

Version 2.0 continues this trajectory while addressing quality concerns and raising difficulty before saturation set in.

Quality Commitment

Terminal-Bench 2.0 represents a commitment to maintaining the highest quality evaluation infrastructure as AI agent capabilities increase:

~3 reviewer-hours per task ensures tasks are solvable, realistic, well-specified
Continuous improvement based on community feedback
Transparent methodology with open-source Harbor framework
Proactive difficulty scaling to prevent saturation

Differentiation Through Scope

While SWE-bench measures software patch correctness and GAIA measures multi-step reasoning with ground-truth answers, Terminal-Bench uniquely evaluates:

Operational reliability across diverse terminal tasks
Multi-step workflow planning and recovery
Tool use in live environments
Cross-domain capabilities (not just software engineering)

This broader scope makes Terminal-Bench particularly valuable for teams building general-purpose agents rather than domain-specific tools.

What Terminal-Bench 2.0 Tells Us About AI Agents in 2026

The Capability Ceiling

Frontier models still fail on 18-35% of tasks that humans complete routinely. This is not a transient gap—it reflects fundamental limitations in current agent architectures:

Planning under uncertainty: Agents struggle when the next step depends on unknown output
Error recovery: When a command fails, agents often retry the same approach or give up
Domain knowledge transfer: Expertise in one domain (e.g., Python) doesn't automatically transfer to another (e.g., systems administration)
Multi-step reasoning: Chains of 5+ dependent steps show exponential failure rates

The Agent Design Multiplier

Resolution rate correlates strongly with model capability AND agent orchestration. The 17% improvement Gemini 2.5 Pro saw with better scaffolding demonstrates that:

Agent frameworks matter as much as underlying models
Prompt engineering, tool design, and error handling are not solved problems
There's no single "best" agent architecture—different designs excel at different task types

The Real-World Relevance Question

Terminal-Bench 2.0's tasks are inspired by real workflows, but are they representative of what actually matters in production?

Arguments for relevance:

Tasks come from actual developer and sysadmin work
Cover diverse domains encountered in practice
Test end-to-end completion, not isolated skills

Arguments for caution:

89 tasks cannot capture all operational scenarios
Containerized environments differ from production messiness
Pass/fail scoring misses "good enough" trade-offs

The 37% gap between lab benchmark scores and real-world deployment performance observed across enterprise agentic AI systems suggests that even Terminal-Bench—among the best available benchmarks—cannot fully predict production readiness.

The Saturation Timeline

Version 1.0 saw frontier models climb from 20% (2025) to >50% in six months. Version 2.0 reset the bar, but models are already approaching 73%. If progress continues at this pace:

65% threshold crossed: Already achieved by top models
75% threshold likely by: Mid-2026
85% threshold likely by: Late 2026 to early 2027
Saturation (>90%): Potentially mid-2027

The team will likely need Terminal-Bench 3.0 within 12-18 months to maintain differentiation at the frontier.

How to Use Terminal-Bench 2.0 in Your Workflow

Running the Benchmark

Official Distribution via Harbor:

harbor run -d terminal-bench@2.0

This command:

Pulls the Terminal-Bench 2.0 task registry
Provisions Daytona containers for each task
Runs your agent against all 89 tasks
Generates detailed CTRF reports and binary pass/fail results

Building Custom Agents

Minimum Implementation:

from harbor import BaseInstalledAgent

class MyAgent(BaseInstalledAgent):
    def solve_task(self, instruction: str, container):
        # Your agent logic here
        # Use container.run_bash(command) to execute commands
        # Parse outputs and decide next steps
        # Return when task complete
        pass

Provided Examples:

Claude Code implementation as reference
Terminus 2 for simple scaffold
Documentation at github.com/laude-institute/harbor

Interpreting Results

What a 65% score means:

Agent successfully completes 58 of 89 tasks
31 tasks either fail tests or cannot be completed
Likely struggles with: hard tasks (16% avg), multi-step error recovery, domain-specific knowledge

What to optimize:

Tool-use reliability: Does your agent correctly parse command outputs?
Error recovery: What happens when a command fails?
Planning depth: Can your agent handle 5+ dependent steps?
Domain coverage: Does performance cluster in certain task categories?

The Benchmark Ecosystem: Where Terminal-Bench Fits

Complementary Benchmarks

For a complete picture of agent capability, use Terminal-Bench alongside:

SWE-bench Pro: Software engineering proficiency
GAIA: General-assistant reasoning
OSWorld: GUI-based computer use
LiveCodeBench: Contamination-resistant coding

Each benchmark tests distinct capabilities. High performance on one does not guarantee high performance on others.

When to Use Terminal-Bench

Use Terminal-Bench when:

Evaluating command-line proficiency and operational tasks
Testing multi-step workflow execution
Measuring real-world task completion beyond isolated functions
Comparing agent scaffolding designs with controlled model variables

Don't rely solely on Terminal-Bench when:

Your use case is GUI-heavy (use OSWorld)
You need domain-specific evaluation (use specialized benchmarks like Vals Finance Agent)
Production involves long-term collaboration with humans (no benchmark captures this well)

The 37% Lab-to-Production Gap

Research shows that enterprise agentic AI systems exhibit a 37% gap between lab benchmark scores and real-world deployment performance. Terminal-Bench is among the best proxies for production readiness, but:

Benchmarks test isolated task completion
Production involves messy context, changing requirements, human collaboration
Error detectability and correction ease matter as much as success rate

Recommendation: Use Terminal-Bench for relative comparisons between models and agents, but always validate in your specific production context before deployment.

The Future of Terminal-Bench

Likely Evolution

Based on the 1.0 → 2.0 transition, expect:

Terminal-Bench 3.0 within 12-18 months as models approach 85-90%
Harder tasks requiring deeper domain expertise
Interactive tasks where environment state changes dynamically
Adversarial tasks designed to expose specific failure modes

The Benchmark Treadmill

Terminal-Bench faces the same challenge as all benchmarks: saturation is inevitable. The question is not "if" but "when" and "how to respond."

Two strategies:

Proactive difficulty scaling (the 1.0 → 2.0 approach): Raise the bar before saturation
Dynamic task generation: Continuously generate new tasks to create a moving target (like LiveCodeBench for coding)

Terminal-Bench currently uses strategy #1. A shift to strategy #2 might be necessary to maintain relevance beyond 2027.

Community Contribution

Terminal-Bench 2.0 selected 89 tasks from 229 submissions by 93 contributors through crowdsourcing. This community-driven approach:

Brings diverse domain expertise
Scales task creation beyond core team bandwidth
Reflects real-world task diversity

Future versions will likely lean more heavily on community contributions, potentially with:

Submission portal for new tasks
Peer review process for quality control
Bounties for particularly challenging tasks

Bottom Line

Terminal-Bench 2.0 has earned its place as the de facto standard for AI agent evaluation because it tests what actually matters: Can agents complete real-world operational tasks reliably, recover from errors, and execute multi-step workflows across diverse domains?

The answer in 2026 is: mostly, but not entirely. Frontier models score 65-73% direct, with agent scaffolding pushing top combinations to 81-82%. This leaves a meaningful capability gap that differentiates systems and reveals failure modes invisible to saturated academic benchmarks.

For teams building agents, Terminal-Bench 2.0 provides:

Relative comparison data to choose models and architectures
Failure mode visibility to guide optimization
Reproducible evaluation infrastructure via Harbor and Daytona
Industry-standard reporting for stakeholder communication

But remember: Terminal-Bench scores are proxies, not guarantees. The 37% lab-to-production gap means you still need to validate in your specific context before trusting an agent with production work.

Read the official resources:

For complementary perspectives on agent evaluation, benchmarking, and production deployment, see our other guides:

Disclosure: This post is editorial commentary on public materials from the Laude Institute, Stanford University, Snorkel AI, and the Terminal-Bench community. For academic or production citations, use the primary paper and official leaderboard data.

What Terminal-Bench 2.0 Actually Measures

The Core Thesis

Task Structure and Evaluation

Verification Process

The Task Taxonomy: What Agents Actually Face

Software Engineering

Security & Cryptography

Machine Learning & Data Science

System Administration

Domain-Specific Tasks

Difficulty Scaling

Version 2.0: What Changed and Why

The Quality Problem in Version 1.0

The 2.0 Response: Proactive Difficulty Scaling

The Harbor Framework: How Terminal-Bench Actually Runs

Why Harbor Exists

Harbor Architecture

Daytona: From Local to Cloud

Security and Contamination Prevention

The 2026 Leaderboard: What Models Can (and Can't) Do

Direct Model Performance

Agent + Model Combinations

Evaluated Agents

Performance by Difficulty

Terminal-Bench vs. Other Agent Benchmarks

Terminal-Bench vs. SWE-bench

Terminal-Bench vs. GAIA

Performance Variability Across Benchmarks

The Benchmark Gaming Problem

Why Terminal-Bench 2.0 Became the Industry Standard

Addressing a Critical Gap

Rapid Industry Adoption

Quality Commitment

Differentiation Through Scope

What Terminal-Bench 2.0 Tells Us About AI Agents in 2026

The Capability Ceiling

The Agent Design Multiplier

The Real-World Relevance Question

The Saturation Timeline

How to Use Terminal-Bench 2.0 in Your Workflow

Running the Benchmark

Building Custom Agents

Interpreting Results

The Benchmark Ecosystem: Where Terminal-Bench Fits

Complementary Benchmarks

When to Use Terminal-Bench

The 37% Lab-to-Production Gap

The Future of Terminal-Bench

Likely Evolution

The Benchmark Treadmill

Community Contribution

Bottom Line

Related posts

AI Benchmarks in 2026: The Complete Guide to MMLU, GPQA, SWE-bench, and Beyond

Anthropic Project Deal: Claude AI Agents Negotiate 186 Deals in Office Marketplace Experiment

What is Hermes Agent, and how does it work?