What is Agents' Last Exam (ALE)?

Agents' Last Exam (ALE) is a benchmark from UC Berkeley (arxiv:2606.05405) that evaluates AI agents on long-horizon, economically valuable, real-world professional workflows with verifiable outcomes. It includes 1,490 task instances across 55 subfields in 13 industry clusters, sourced from actual projects completed by 250+ domain experts and mapped to the U.S. O*NET/SOC 2018 occupational taxonomy.

Why is it called Agents' Last Exam?

The name carries dual meaning. "Last" as competence threshold: passing means an agent can carry out sustained professional work, not just answer questions about it. "Last" as difficulty frontier: tasks are authentic long-horizon workflows at the boundary of what current systems reliably accomplish. Saturating ALE would signal AI ready for GDP-relevant industrial deployment.

What pass rate do frontier AI agents achieve on ALE?

On the hardest "Last-Exam" tier (38 tasks), most mainstream agent configurations score 0% full pass rate. The best reported result is 2.6% full pass (Cursor and Droid with GPT-5.5 or Opus 4.7). Codex with GPT-5.5 achieves 24% overall pass rate across all tiers but 0% on Last-Exam. Compare that to 82% on Terminal-Bench for the same Codex configuration.

How is ALE different from Terminal-Bench or SWE-bench?

ALE evaluates Generalist Computer-Use Agents (GCUA) that combine GUI interaction, CLI commands, code, and long-horizon planning in single workflows—making it a superset of GUI-only benchmarks like OSWorld and CLI-only benchmarks like Terminal-Bench. Tasks span 55 professional subdomains (finance, law, manufacturing, education) rather than primarily software engineering. Scoring uses deterministic artifact checks, not human judges.

Who built Agents' Last Exam?

ALE was developed at UC Berkeley under Dawn Song's group, with core contributors including Yiyou Sun, Xinyang Han, Weichen Zhang, and 300+ total authors across data contributors and an advisory committee of industry practitioners. Tasks are submitted by domain experts through agents-last-exam.org and undergo multi-round quality control before admission.

Is ALE a public or private benchmark?

Only 150 of 1,490 task instances (~10%) are publicly released to prevent benchmark contamination from pre-training or task-specific optimization. Private tasks rotate into the public set over time while retired public tasks are replaced—a "living benchmark" design for rolling evaluation across model generations.

Agents' Last Exam (ALE): Berkeley AI Agent Benchmark | explainx.ai Blog

Agents' Last Exam (ALE) is a new benchmark from UC Berkeley that asks a question no academic test has answered convincingly: can AI agents actually do the work human experts perform—not answer trivia about it, but deliver real professional outputs across finance, law, manufacturing, and 52 other subdomains?

The paper, published on June 3, 2026 (arxiv:2606.05405) and submitted to Hugging Face Papers by author Xinyang Han, became the #1 Paper of the Day within a week. The headline result is sobering: on ALE's hardest "Last-Exam" tier, frontier agent configurations average a 2.6% full pass rate. Codex with GPT-5.5, which scores 82% on Terminal-Bench, passes 0% of Last-Exam tasks.

This post explains what ALE measures, how it was built, what the numbers mean, and why it matters for anyone building or deploying AI agents in 2026.

TL;DR: Agents' Last Exam

Question	Answer
Paper	arxiv:2606.05405 — UC Berkeley, Dawn Song group
Task count	1,490 instances across 55 subfields, 13 industry clusters
Task origin	Real projects experts completed on the job (not synthetic)
Agent type tested	Generalist Computer-Use Agent (GCUA): GUI + CLI + code + planning
Scoring	Deterministic code evaluators against reference artifacts
Hardest tier pass rate	2.6% average full pass (most configs: 0%)
Public tasks	150 of 1,490 (~10%) to prevent contamination
Submit tasks	agents-last-exam.org/submit

The Gap ALE Targets

AI systems have cleared benchmark after benchmark—Olympiad math, competitive programming, medical licensing exams. Yet economic output in core industries has not transformed at the same pace. The ALE authors call this a utility problem: the field optimizes what it can measure, and what it measures rarely matches long-horizon, economically valuable professional work.

Prior benchmarks trade away realism, breadth, or verifiability:

Benchmark type	Strength	Limitation
Terminal-Bench	Real CLI workflows, deterministic scoring	CLI-only; developer/sysadmin focus
OSWorld	GUI computer use	GUI-only; shorter tasks
SWE-bench	Real GitHub issues	Software engineering only
GDPval / Remote Labor Index	Economically valuable work	Human judges required
Question-answering suites	Easy to verify	Not workflow execution

ALE attempts to combine real professional workflows, broad industry coverage, and deterministic verification—all three at once.

What Makes ALE Different

1. Real Origins, Not Synthetic Scenarios

Every task comes from actual projects domain experts completed on the job—work that took days or weeks—not crowdsourced micro-tasks invented by non-experts. Experts submit through a dedicated portal; proposals must specify five components:

Natural-language description
Input files
Target software (the tools professionals actually use)
Expected deliverable
Evaluation specification

Example of rejected vs. accepted tasks (from the paper):

Rejected (too narrow)	Accepted (end-to-end workflow)
"Apply a color filter in DaVinci"	"Move a running cheetah into another race video" (tracking, rotoscoping, compositing, color matching)
"Design an RPG game with monsters"	"Reproduce the game mota.exe using RPGMaker XP" (verifiable map geometry, character attributes, event states)

2. O*NET-Grounded Taxonomy

Rather than picking industries ad hoc, ALE maps to O*NET / SOC 2018—the U.S. federal occupational taxonomy. The result: 13 industry clusters, 55 subdomains, covering non-physical work where software-mediated workflows dominate.

The paper notes that even the union of 16 major prior benchmarks leaves 13 of 55 ALE subdomains entirely uncovered.

3. Generalist Computer-Use Agents (GCUA)

ALE tasks routinely interleave GUI interaction (desktop apps, browsers, domain software), CLI operations (shell, code, file manipulation), and web research within a single workflow. The paper defines five agent capability layers:

Layer	Function	CLI agents	GUI agents	GCUA
Brain	LLM reasoning/planning	✅	✅	✅
Eyes	GUI perception (screenshots)	❌	✅	✅
Body	Orchestration, control flow	✅	Shallow	✅
Hands	Structured tool invocation	✅	Narrow	✅
Feet	Runtime substrate	✅	Restricted	✅

Claude Code, Codex, Cursor, and OpenClaw are evaluated as GCUA by adding GUI-as-Tool mode—a CUA MCP bridge exposing 14 desktop-action tools alongside shell and file tools.

4. Deterministic Scoring (No Human Judges)

Deliverables vary wildly: CAM toolpaths, financial workbooks, 3D meshes, game world states, rendered screenshots. ALE composes scoring from artifact modes:

Exact / hashed values
Structured numeric fields with tolerances
Geometric surface distances
Behavioral world state under fixed input trajectories
Free-text rubric scoring (minority of tasks)

LLM-as-judge is rejected at QC unless no deterministic alternative exists—and even then, scoring uses narrow yes/no probes, not holistic "does this look good?" prompts.

Task Construction Pipeline

Tasks pass five gates before admission:

Expert sourcing — Advisory committee recruits domain specialists
Task submission — Experts upload past professional projects via portal
First-pass review — Conference-style decisions (major/minor revision, accept, strong accept)
Task implementation — Engineers build VM containers, evaluation logic, dry-runs
Final QC — Expert committee peer review of reference outputs and evaluation bounds

Release strategy: 150 public / 1,017 private / 323 pending QC out of 1,490 total. Private tasks rotate into public over time to prevent pre-training contamination—a living benchmark design.

Results: Three Difficulty Tiers

ALE organizes evaluation into three tiers by cost and difficulty:

Tier	Tasks	Purpose	Top pass rates
Near-Term	67	Cost-effective leaderboard competition	~38–40% (Codex GPT-5.5)
Full-Spectrum	55	One task per subdomain for coverage	~20–24%
Last-Exam	38	Long-term headroom, milestone evals	0–2.6%

Mainstream Agent Results (selected)

Agent + Model	Near-Term Pass	Full-Spectrum Pass	Last-Exam Pass	Overall Pass
Codex (GPT-5.5)	38.1%	22.7%	0.0%	24.0%
ALE-Claw (GPT-5.5)	32.8%	23.6%	2.6%	23.0%
Claude Code (Fable 5)	34.3%	20.9%	0.0%	22.0%
Cursor (GPT-5.5)	32.1%	20.0%	2.6%	20.7%
Cursor (Opus 4.7)	29.9%	20.0%	2.6%	20.4%
Gemini CLI (Gemini 3.1 Pro)	26.9%	12.7%	0.0%	15.8%
Claude Code (Opus 4.8)	26.9%	10.9%	0.0%	15.8%

Key comparison: Codex + GPT-5.5 scores 82% on Terminal-Bench but only 23.3% overall on ALE-CLI (the Linux-only subset)—and 0% on Last-Exam.

Update — July 10, 2026: OpenAI's GPT-5.6 GA thread reports Sol at 53.6 on ALE — +13.1 points above Claude Fable 5 (adaptive), with Terra/Luna beating Fable at ~1/16 cost. See GPT-5.6 vs Fable 5 comparison for rollout context. Thinking Machines argues METR-style solo-autonomy metrics miss collaborative yield. Independent verification pending.

Each ALE task run costs $3–10 and takes tens of minutes to hours. Evaluating the full 152-task public set is expensive by design.

What the Failures Look Like

The paper's failure taxonomy for Claude Code + Opus 4.7 runs breaks down root causes across tool types:

Planning failures — Agent loses track of multi-step workflow state
GUI errors — Misclicks, wrong dialog interactions, visual misread
File manipulation — Wrong formats, incomplete deliverables
Bash/CLI errors — Script failures, environment mismatches
Web research — Incomplete information gathering

Domain-level scores also vary sharply: computational mathematics and agriculture/environment score highest (~55–85%), while education scores below 25%—likely reflecting both intrinsic model capability gaps and uneven training exposure to specialized professional tools.

ALE vs. Terminal-Bench: Complementary, Not Competing

If you follow agent benchmarks, read our Terminal-Bench 2.0 deep dive—ALE and Terminal-Bench measure different surfaces:

Dimension	Terminal-Bench 2.0	Agents' Last Exam
Primary surface	CLI / terminal	GUI + CLI combined
Domain scope	Dev, ML, security, bio (~89 tasks)	55 professional subdomains (1,490 tasks)
Task origin	Curated workflow-inspired	Expert's actual completed projects
Best agent score	~82% (Codex GPT-5.5)	~24% overall; 2.6% hardest tier
Economic framing	Operational reliability	GDP-relevant professional work

Terminal-Bench tells you whether your agent can operate a terminal. ALE tells you whether it can do someone's job.

Who Should Care

AI lab researchers and eval teams: ALE is the most ambitious attempt yet to bridge benchmark success and economic deployment. If you're optimizing for Terminal-Bench saturation, ALE exposes what terminal mastery doesn't cover.

Enterprise AI buyers: A vendor claiming "90% task automation" on internal benchmarks may score 0% on Last-Exam. Ask which benchmark, which tier, and whether scoring used human judges.

Agent builders (Claude Code, OpenClaw, Cursor users): The GCUA architecture—Brain + Eyes + Body + Hands + Feet in one loop—is where the field is heading. GUI-as-Tool integration is now table stakes for professional workflows.

Policy and labor economists: ALE's O*NET grounding makes it cite-ready for discussions about which occupations face near-term automation pressure vs. which remain human-dominant.

Living Benchmark & Community Participation

ALE is designed to grow continuously:

Submit tasks: Domain experts can contribute real workflows at agents-last-exam.org
Rolling evaluation: Private tasks rotate public; retired public tasks replaced
Expert advisory committees per domain ensure ongoing authenticity

The Hugging Face community note from paper author Xinyang Han emphasizes three pillars: real origins, unconstrained method (agents solve however they want, judged on results), and objective scoring (deterministic evaluators only).

Related benchmarks the Semantic Scholar librarian bot flagged: WildClawBench, AgenticVBench, RealClawBench, SWE-Marathon, and TerminalWorld—all probing long-horizon real-world agent evaluation from different angles.

What ALE Does Not Claim

Worth stating clearly:

ALE covers non-physical, software-mediated industries—not robotics, construction, or clinical procedures requiring physical presence
10% public release means leaderboard scores on public tasks may not fully represent private-pool difficulty
Pass rate ≠ economic replacement rate—a 2.6% pass rate on the hardest tier doesn't mean 97.4% of jobs are safe; it means current agents fail most expert-grade end-to-end deliverables
Results reflect June 2026 frontier models—the living benchmark will evolve as models improve

Summary

Agents' Last Exam is the most serious attempt yet to evaluate AI agents on economically valuable, long-horizon, real professional work with deterministic scoring. Built from 1,490 task instances across 55 subdomains with 250+ industry experts, grounded in O*NET, and testing Generalist Computer-Use Agents that combine GUI and CLI capabilities, ALE exposes a gap that Terminal-Bench and SWE-bench don't capture.

The 2.6% Last-Exam pass rate is the number to remember. Benchmark saturation elsewhere has not translated into professional workflow mastery. Until agents pass this exam—not answer questions about passing it—GDP-relevant AI automation remains further out than leaderboard hype suggests.

Results, task counts, and agent scores cited from arxiv:2606.05405 as of June 2026. Model names and pass rates reflect paper Table 1; verify against upstream for subsequent benchmark updates.

This post explains what ALE measures, how it was built, what the numbers mean, and why it matters for anyone building or deploying AI agents in 2026.

TL;DR: Agents' Last Exam

Question	Answer
Paper	arxiv:2606.05405 — UC Berkeley, Dawn Song group
Task count	1,490 instances across 55 subfields, 13 industry clusters
Task origin	Real projects experts completed on the job (not synthetic)
Agent type tested	Generalist Computer-Use Agent (GCUA): GUI + CLI + code + planning
Scoring	Deterministic code evaluators against reference artifacts
Hardest tier pass rate	2.6% average full pass (most configs: 0%)
Public tasks	150 of 1,490 (~10%) to prevent contamination
Submit tasks	agents-last-exam.org/submit

The Gap ALE Targets

Prior benchmarks trade away realism, breadth, or verifiability:

Benchmark type	Strength	Limitation
Terminal-Bench	Real CLI workflows, deterministic scoring	CLI-only; developer/sysadmin focus
OSWorld	GUI computer use	GUI-only; shorter tasks
SWE-bench	Real GitHub issues	Software engineering only
GDPval / Remote Labor Index	Economically valuable work	Human judges required
Question-answering suites	Easy to verify	Not workflow execution

ALE attempts to combine real professional workflows, broad industry coverage, and deterministic verification—all three at once.

What Makes ALE Different

1. Real Origins, Not Synthetic Scenarios

Natural-language description
Input files
Target software (the tools professionals actually use)
Expected deliverable
Evaluation specification

Example of rejected vs. accepted tasks (from the paper):

Rejected (too narrow)	Accepted (end-to-end workflow)
"Apply a color filter in DaVinci"	"Move a running cheetah into another race video" (tracking, rotoscoping, compositing, color matching)
"Design an RPG game with monsters"	"Reproduce the game mota.exe using RPGMaker XP" (verifiable map geometry, character attributes, event states)

2. O*NET-Grounded Taxonomy

The paper notes that even the union of 16 major prior benchmarks leaves 13 of 55 ALE subdomains entirely uncovered.

3. Generalist Computer-Use Agents (GCUA)

Layer	Function	CLI agents	GUI agents	GCUA
Brain	LLM reasoning/planning	✅	✅	✅
Eyes	GUI perception (screenshots)	❌	✅	✅
Body	Orchestration, control flow	✅	Shallow	✅
Hands	Structured tool invocation	✅	Narrow	✅
Feet	Runtime substrate	✅	Restricted	✅

Claude Code, Codex, Cursor, and OpenClaw are evaluated as GCUA by adding GUI-as-Tool mode—a CUA MCP bridge exposing 14 desktop-action tools alongside shell and file tools.

4. Deterministic Scoring (No Human Judges)

Deliverables vary wildly: CAM toolpaths, financial workbooks, 3D meshes, game world states, rendered screenshots. ALE composes scoring from artifact modes:

Exact / hashed values
Structured numeric fields with tolerances
Geometric surface distances
Behavioral world state under fixed input trajectories
Free-text rubric scoring (minority of tasks)

LLM-as-judge is rejected at QC unless no deterministic alternative exists—and even then, scoring uses narrow yes/no probes, not holistic "does this look good?" prompts.

Task Construction Pipeline

Tasks pass five gates before admission:

Expert sourcing — Advisory committee recruits domain specialists
Task submission — Experts upload past professional projects via portal
First-pass review — Conference-style decisions (major/minor revision, accept, strong accept)
Task implementation — Engineers build VM containers, evaluation logic, dry-runs
Final QC — Expert committee peer review of reference outputs and evaluation bounds

Release strategy: 150 public / 1,017 private / 323 pending QC out of 1,490 total. Private tasks rotate into public over time to prevent pre-training contamination—a living benchmark design.

Results: Three Difficulty Tiers

ALE organizes evaluation into three tiers by cost and difficulty:

Tier	Tasks	Purpose	Top pass rates
Near-Term	67	Cost-effective leaderboard competition	~38–40% (Codex GPT-5.5)
Full-Spectrum	55	One task per subdomain for coverage	~20–24%
Last-Exam	38	Long-term headroom, milestone evals	0–2.6%

Mainstream Agent Results (selected)

Agent + Model	Near-Term Pass	Full-Spectrum Pass	Last-Exam Pass	Overall Pass
Codex (GPT-5.5)	38.1%	22.7%	0.0%	24.0%
ALE-Claw (GPT-5.5)	32.8%	23.6%	2.6%	23.0%
Claude Code (Fable 5)	34.3%	20.9%	0.0%	22.0%
Cursor (GPT-5.5)	32.1%	20.0%	2.6%	20.7%
Cursor (Opus 4.7)	29.9%	20.0%	2.6%	20.4%
Gemini CLI (Gemini 3.1 Pro)	26.9%	12.7%	0.0%	15.8%
Claude Code (Opus 4.8)	26.9%	10.9%	0.0%	15.8%

Key comparison: Codex + GPT-5.5 scores 82% on Terminal-Bench but only 23.3% overall on ALE-CLI (the Linux-only subset)—and 0% on Last-Exam.

Each ALE task run costs $3–10 and takes tens of minutes to hours. Evaluating the full 152-task public set is expensive by design.

What the Failures Look Like

The paper's failure taxonomy for Claude Code + Opus 4.7 runs breaks down root causes across tool types:

Planning failures — Agent loses track of multi-step workflow state
GUI errors — Misclicks, wrong dialog interactions, visual misread
File manipulation — Wrong formats, incomplete deliverables
Bash/CLI errors — Script failures, environment mismatches
Web research — Incomplete information gathering

ALE vs. Terminal-Bench: Complementary, Not Competing

If you follow agent benchmarks, read our Terminal-Bench 2.0 deep dive—ALE and Terminal-Bench measure different surfaces:

Dimension	Terminal-Bench 2.0	Agents' Last Exam
Primary surface	CLI / terminal	GUI + CLI combined
Domain scope	Dev, ML, security, bio (~89 tasks)	55 professional subdomains (1,490 tasks)
Task origin	Curated workflow-inspired	Expert's actual completed projects
Best agent score	~82% (Codex GPT-5.5)	~24% overall; 2.6% hardest tier
Economic framing	Operational reliability	GDP-relevant professional work

Terminal-Bench tells you whether your agent can operate a terminal. ALE tells you whether it can do someone's job.

Who Should Care

Enterprise AI buyers: A vendor claiming "90% task automation" on internal benchmarks may score 0% on Last-Exam. Ask which benchmark, which tier, and whether scoring used human judges.

Policy and labor economists: ALE's O*NET grounding makes it cite-ready for discussions about which occupations face near-term automation pressure vs. which remain human-dominant.

Living Benchmark & Community Participation

ALE is designed to grow continuously:

Submit tasks: Domain experts can contribute real workflows at agents-last-exam.org
Rolling evaluation: Private tasks rotate public; retired public tasks replaced
Expert advisory committees per domain ensure ongoing authenticity

What ALE Does Not Claim

Worth stating clearly:

ALE covers non-physical, software-mediated industries—not robotics, construction, or clinical procedures requiring physical presence
10% public release means leaderboard scores on public tasks may not fully represent private-pool difficulty
Pass rate ≠ economic replacement rate—a 2.6% pass rate on the hardest tier doesn't mean 97.4% of jobs are safe; it means current agents fail most expert-grade end-to-end deliverables
Results reflect June 2026 frontier models—the living benchmark will evolve as models improve

Summary

Results, task counts, and agent scores cited from arxiv:2606.05405 as of June 2026. Model names and pass rates reflect paper Table 1; verify against upstream for subsequent benchmark updates.

TL;DR: Agents' Last Exam

The Gap ALE Targets

What Makes ALE Different

1. Real Origins, Not Synthetic Scenarios

2. O*NET-Grounded Taxonomy

3. Generalist Computer-Use Agents (GCUA)

4. Deterministic Scoring (No Human Judges)

Task Construction Pipeline

Results: Three Difficulty Tiers

Mainstream Agent Results (selected)

What the Failures Look Like

ALE vs. Terminal-Bench: Complementary, Not Competing

Who Should Care

Living Benchmark & Community Participation

What ALE Does Not Claim

Summary

Related Reading

TL;DR: Agents' Last Exam

The Gap ALE Targets

What Makes ALE Different

1. Real Origins, Not Synthetic Scenarios

2. O*NET-Grounded Taxonomy

3. Generalist Computer-Use Agents (GCUA)

4. Deterministic Scoring (No Human Judges)

Task Construction Pipeline

Results: Three Difficulty Tiers

Mainstream Agent Results (selected)

What the Failures Look Like

ALE vs. Terminal-Bench: Complementary, Not Competing

Who Should Care

Living Benchmark & Community Participation

What ALE Does Not Claim

Summary

Related Reading

Related posts

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

How to Build Your Own Enterprise AI Benchmark — After Nadella’s Paradox

CoffeeBench: Sakana AI Benchmarks 90-Day LLM Supply Chain Management

Related posts

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

How to Build Your Own Enterprise AI Benchmark — After Nadella’s Paradox

CoffeeBench: Sakana AI Benchmarks 90-Day LLM Supply Chain Management