Agents' Last Exam (ALE) is a new benchmark from UC Berkeley that asks a question no academic test has answered convincingly: can AI agents actually do the work human experts perform—not answer trivia about it, but deliver real professional outputs across finance, law, manufacturing, and 52 other subdomains?
The paper, published on June 3, 2026 (arxiv:2606.05405) and submitted to Hugging Face Papers by author Xinyang Han, became the #1 Paper of the Day within a week. The headline result is sobering: on ALE's hardest "Last-Exam" tier, frontier agent configurations average a 2.6% full pass rate. Codex with GPT-5.5, which scores 82% on Terminal-Bench, passes 0% of Last-Exam tasks.
This post explains what ALE measures, how it was built, what the numbers mean, and why it matters for anyone building or deploying AI agents in 2026.
TL;DR: Agents' Last Exam
| Question | Answer |
|---|---|
| Paper | arxiv:2606.05405 — UC Berkeley, Dawn Song group |
| Task count | 1,490 instances across 55 subfields, 13 industry clusters |
| Task origin | Real projects experts completed on the job (not synthetic) |
| Agent type tested | Generalist Computer-Use Agent (GCUA): GUI + CLI + code + planning |
| Scoring | Deterministic code evaluators against reference artifacts |
| Hardest tier pass rate | 2.6% average full pass (most configs: 0%) |
| Public tasks | 150 of 1,490 (~10%) to prevent contamination |
| Submit tasks | agents-last-exam.org/submit |
The Gap ALE Targets
AI systems have cleared benchmark after benchmark—Olympiad math, competitive programming, medical licensing exams. Yet economic output in core industries has not transformed at the same pace. The ALE authors call this a utility problem: the field optimizes what it can measure, and what it measures rarely matches long-horizon, economically valuable professional work.
Prior benchmarks trade away realism, breadth, or verifiability:
| Benchmark type | Strength | Limitation |
|---|---|---|
| Terminal-Bench | Real CLI workflows, deterministic scoring | CLI-only; developer/sysadmin focus |
| OSWorld | GUI computer use | GUI-only; shorter tasks |
| SWE-bench | Real GitHub issues | Software engineering only |
| GDPval / Remote Labor Index | Economically valuable work | Human judges required |
| Question-answering suites | Easy to verify | Not workflow execution |
ALE attempts to combine real professional workflows, broad industry coverage, and deterministic verification—all three at once.
What Makes ALE Different
1. Real Origins, Not Synthetic Scenarios
Every task comes from actual projects domain experts completed on the job—work that took days or weeks—not crowdsourced micro-tasks invented by non-experts. Experts submit through a dedicated portal; proposals must specify five components:
- Natural-language description
- Input files
- Target software (the tools professionals actually use)
- Expected deliverable
- Evaluation specification
Example of rejected vs. accepted tasks (from the paper):
| Rejected (too narrow) | Accepted (end-to-end workflow) |
|---|---|
| "Apply a color filter in DaVinci" | "Move a running cheetah into another race video" (tracking, rotoscoping, compositing, color matching) |
| "Design an RPG game with monsters" | "Reproduce the game mota.exe using RPGMaker XP" (verifiable map geometry, character attributes, event states) |
2. O*NET-Grounded Taxonomy
Rather than picking industries ad hoc, ALE maps to O*NET / SOC 2018—the U.S. federal occupational taxonomy. The result: 13 industry clusters, 55 subdomains, covering non-physical work where software-mediated workflows dominate.
The paper notes that even the union of 16 major prior benchmarks leaves 13 of 55 ALE subdomains entirely uncovered.
3. Generalist Computer-Use Agents (GCUA)
ALE tasks routinely interleave GUI interaction (desktop apps, browsers, domain software), CLI operations (shell, code, file manipulation), and web research within a single workflow. The paper defines five agent capability layers:
| Layer | Function | CLI agents | GUI agents | GCUA |
|---|---|---|---|---|
| Brain | LLM reasoning/planning | ✅ | ✅ | ✅ |
| Eyes | GUI perception (screenshots) | ❌ | ✅ | ✅ |
| Body | Orchestration, control flow | ✅ | Shallow | ✅ |
| Hands | Structured tool invocation | ✅ | Narrow | ✅ |
| Feet | Runtime substrate | ✅ | Restricted | ✅ |
Claude Code, Codex, Cursor, and OpenClaw are evaluated as GCUA by adding GUI-as-Tool mode—a CUA MCP bridge exposing 14 desktop-action tools alongside shell and file tools.
4. Deterministic Scoring (No Human Judges)
Deliverables vary wildly: CAM toolpaths, financial workbooks, 3D meshes, game world states, rendered screenshots. ALE composes scoring from artifact modes:
- Exact / hashed values
- Structured numeric fields with tolerances
- Geometric surface distances
- Behavioral world state under fixed input trajectories
- Free-text rubric scoring (minority of tasks)
LLM-as-judge is rejected at QC unless no deterministic alternative exists—and even then, scoring uses narrow yes/no probes, not holistic "does this look good?" prompts.
Task Construction Pipeline
Tasks pass five gates before admission:
- Expert sourcing — Advisory committee recruits domain specialists
- Task submission — Experts upload past professional projects via portal
- First-pass review — Conference-style decisions (major/minor revision, accept, strong accept)
- Task implementation — Engineers build VM containers, evaluation logic, dry-runs
- Final QC — Expert committee peer review of reference outputs and evaluation bounds
Release strategy: 150 public / 1,017 private / 323 pending QC out of 1,490 total. Private tasks rotate into public over time to prevent pre-training contamination—a living benchmark design.
Results: Three Difficulty Tiers
ALE organizes evaluation into three tiers by cost and difficulty:
| Tier | Tasks | Purpose | Top pass rates |
|---|---|---|---|
| Near-Term | 67 | Cost-effective leaderboard competition | ~38–40% (Codex GPT-5.5) |
| Full-Spectrum | 55 | One task per subdomain for coverage | ~20–24% |
| Last-Exam | 38 | Long-term headroom, milestone evals | 0–2.6% |
Mainstream Agent Results (selected)
| Agent + Model | Near-Term Pass | Full-Spectrum Pass | Last-Exam Pass | Overall Pass |
|---|---|---|---|---|
| Codex (GPT-5.5) | 38.1% | 22.7% | 0.0% | 24.0% |
| ALE-Claw (GPT-5.5) | 32.8% | 23.6% | 2.6% | 23.0% |
| Claude Code (Fable 5) | 34.3% | 20.9% | 0.0% | 22.0% |
| Cursor (GPT-5.5) | 32.1% | 20.0% | 2.6% | 20.7% |
| Cursor (Opus 4.7) | 29.9% | 20.0% | 2.6% | 20.4% |
| Gemini CLI (Gemini 3.1 Pro) | 26.9% | 12.7% | 0.0% | 15.8% |
| Claude Code (Opus 4.8) | 26.9% | 10.9% | 0.0% | 15.8% |
Key comparison: Codex + GPT-5.5 scores 82% on Terminal-Bench but only 23.3% overall on ALE-CLI (the Linux-only subset)—and 0% on Last-Exam.
Each ALE task run costs $3–10 and takes tens of minutes to hours. Evaluating the full 152-task public set is expensive by design.
What the Failures Look Like
The paper's failure taxonomy for Claude Code + Opus 4.7 runs breaks down root causes across tool types:
- Planning failures — Agent loses track of multi-step workflow state
- GUI errors — Misclicks, wrong dialog interactions, visual misread
- File manipulation — Wrong formats, incomplete deliverables
- Bash/CLI errors — Script failures, environment mismatches
- Web research — Incomplete information gathering
Domain-level scores also vary sharply: computational mathematics and agriculture/environment score highest (~55–85%), while education scores below 25%—likely reflecting both intrinsic model capability gaps and uneven training exposure to specialized professional tools.
ALE vs. Terminal-Bench: Complementary, Not Competing
If you follow agent benchmarks, read our Terminal-Bench 2.0 deep dive—ALE and Terminal-Bench measure different surfaces:
| Dimension | Terminal-Bench 2.0 | Agents' Last Exam |
|---|---|---|
| Primary surface | CLI / terminal | GUI + CLI combined |
| Domain scope | Dev, ML, security, bio (~89 tasks) | 55 professional subdomains (1,490 tasks) |
| Task origin | Curated workflow-inspired | Expert's actual completed projects |
| Best agent score | ~82% (Codex GPT-5.5) | ~24% overall; 2.6% hardest tier |
| Economic framing | Operational reliability | GDP-relevant professional work |
Terminal-Bench tells you whether your agent can operate a terminal. ALE tells you whether it can do someone's job.
Who Should Care
AI lab researchers and eval teams: ALE is the most ambitious attempt yet to bridge benchmark success and economic deployment. If you're optimizing for Terminal-Bench saturation, ALE exposes what terminal mastery doesn't cover.
Enterprise AI buyers: A vendor claiming "90% task automation" on internal benchmarks may score 0% on Last-Exam. Ask which benchmark, which tier, and whether scoring used human judges.
Agent builders (Claude Code, OpenClaw, Cursor users): The GCUA architecture—Brain + Eyes + Body + Hands + Feet in one loop—is where the field is heading. GUI-as-Tool integration is now table stakes for professional workflows.
Policy and labor economists: ALE's O*NET grounding makes it cite-ready for discussions about which occupations face near-term automation pressure vs. which remain human-dominant.
Living Benchmark & Community Participation
ALE is designed to grow continuously:
- Submit tasks: Domain experts can contribute real workflows at agents-last-exam.org
- Rolling evaluation: Private tasks rotate public; retired public tasks replaced
- Expert advisory committees per domain ensure ongoing authenticity
The Hugging Face community note from paper author Xinyang Han emphasizes three pillars: real origins, unconstrained method (agents solve however they want, judged on results), and objective scoring (deterministic evaluators only).
Related benchmarks the Semantic Scholar librarian bot flagged: WildClawBench, AgenticVBench, RealClawBench, SWE-Marathon, and TerminalWorld—all probing long-horizon real-world agent evaluation from different angles.
What ALE Does Not Claim
Worth stating clearly:
- ALE covers non-physical, software-mediated industries—not robotics, construction, or clinical procedures requiring physical presence
- 10% public release means leaderboard scores on public tasks may not fully represent private-pool difficulty
- Pass rate ≠ economic replacement rate—a 2.6% pass rate on the hardest tier doesn't mean 97.4% of jobs are safe; it means current agents fail most expert-grade end-to-end deliverables
- Results reflect June 2026 frontier models—the living benchmark will evolve as models improve
Summary
Agents' Last Exam is the most serious attempt yet to evaluate AI agents on economically valuable, long-horizon, real professional work with deterministic scoring. Built from 1,490 task instances across 55 subdomains with 250+ industry experts, grounded in O*NET, and testing Generalist Computer-Use Agents that combine GUI and CLI capabilities, ALE exposes a gap that Terminal-Bench and SWE-bench don't capture.
The 2.6% Last-Exam pass rate is the number to remember. Benchmark saturation elsewhere has not translated into professional workflow mastery. Until agents pass this exam—not answer questions about passing it—GDP-relevant AI automation remains further out than leaderboard hype suggests.
Related Reading
- Mastercard Agent Pay for Machines (AP4M)
- Terminal-Bench 2.0: AI Agent Benchmark Deep Dive
- Claude Code Context Window Management
- What is MCP? Model Context Protocol Guide
- Is OpenClaw Safe? Anthropic Ban Analysis
- Stanford AI Index 2026 Takeaways
Results, task counts, and agent scores cited from arxiv:2606.05405 as of June 2026. Model names and pass rates reflect paper Table 1; verify against upstream for subsequent benchmark updates.